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Abstract 

The properties of the normal distribution under linear transformation, as well the 
easy way to compute the covariance matrix of marginals and conditionals, offer a unique 
opportunity to get an insight about several aspects of uncertainties in measurements. 
The way to build the overall covariance matrix in a few, but conceptually relevant cases 
is illustrated: several observations made with (possibly) different instruments measuring 
the same quantity; effect of systematics (although limited to offset, in order to stick to 
linear models) on the determination of the ‘true value’, as well in the prediction of future 
observations; correlations which arise when different quantities are measured with the 
same instrument affected by an offset uncertainty; inferences and predictions based 
on averages; inference about constrained values; fits under some assumptions (linear 
models with known standard deviations). Many numerical examples are provided, 
exploiting the ability of the R language to handle large matrices and to produce high 
quality plots. Some of the results are framed in the general problem of ‘propagation of 
evidence’, crucial in analyzing graphical models of knowledge. 


“So far as the theories of mathematics are about reality, they are not certain; 

so far as they are certain, they are not about reality. 

(A. Einstein) 

“If we were not ignorant there would be no probability, 

there could only be certainty. 
But our ignorance cannot be absolute, 
for then there would be no longer any probability at all.” 

(H. Poincare) 

“Probability is good sense reduced to a calculus” 

(S. Laplace) 


Note based on lectures to PhD students in Rome. 


‘All models are wrong but some are useful” 

(G. Box) 



1 Introduction 


The opening quotes set up the frame in which this paper has been written: in the sciences 
we always deal with uncertainties; being in condition on uncertainty we can only state 
‘somehow’ how much we believe something; in order to do that we need to build up prob¬ 
abilistic models based on good sense. For example, if we are uncertain about the value we 
are going to read on an instrument, we can make probabilistic assessments about it. But 
in general our interest is the numerical value of a physies quantity. We are usually in great 
condition of uncertainty before the measurement, but we still remain with some degree of 
uncertainty after the measurement has been performed. Models enter in the construction 
of the the causal network which connects physics quantities to what we can observe on 
the instruments. They are also important because it is convenient to use, whenever it is 
possible, probability distributions, instead than to assign individual probabilities to each 
individual ‘value’ (after suitable discretization) that a physics quantity might assume. 

As we know, there are good reasons why in many cases the Gaussian distribution (or 
normal distribution) offers a reasonable and eonvenient description of the probability that 
the quantity of interest lies within some bounds. But it is important to remember that, 
as it was clear to Gauss [T] when he derived the famous distribution for the measurement 
errors, one should not take literally the fact that the variable appearing in the formula can 
range from minus infinite to plus infinite: an apple cannot have infinite mass, or a negative 
one! 

Sticking hereafter to Gaussian distributions, it is clear that if we are only interested to 
the probability density function (pdf) of a variable at the time, we can only describe our 
uncertainty about that quantity, and nothing more. The game becomes interesting when we 
study the joint distribution of several variables, because this is the way we can learn about 
some of them assuming the values of the others. For example, if we assume the joint pdf 
f{xi,X2 1 1 ) of variables Xi and X2 under the state of information I (on which we ground 
our assumptions), we can evaluate f{xi \ X2 ,/), that is the pdf adding the extra condition 
X2 = X2, which is usually not the same as /(xi 1 1 ), that is the pdf of Xi for any value X2 
might assume 0 

Let us take for example the three diagrams of Fig. [T]to which we give a physical inter¬ 
pretation: 

1 . In the diagram on the left the variable Xi might represent the numerical value of a 
physics quantity, on which we are in condition on uncertainty, modelled by 


Ai ~AA(Ao,cti), (1) 

where Xq and cii are suitable parameters to state our ‘ignorance’ about Xi (‘complete 
ignorance’, if it does ever exist, is recovered in the limit ai —>■ 00). Instead, X2 is 
then what we read on an instrument when we apply it to Xi. That is, even if we 
knew Xi, we are still uncertain about what we can read on the instrument, as it is 

^The pdf f{xi I 7) is called marginal, although there is never special about this name, since all distri¬ 
butions of a single variable can be thought as being ‘marginal’ to all other possible quantities which we 
are not interested about. f(xi 1*2,7) is instead ‘called’ conditional, although it is a matter of fact that all 
distributions are conditional to a given state of information, here indicated by 7. Note that throughout this 
paper will shall use the same symbol /() for all pdf’s, as it is customary among physicists - I have met 
mathematics oriented guys getting mad by the equation f(x,y) = f{x\y) ■ f{y) because, they say, “the three 
functions cannot be the same”... 
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Figure 1 : Basic models of joint probabilities 


well understood. Modelling this uncertainty by a normal distribution we have, for 
any value ol Xi 


X2|^^~AA(Xi,a2|i), (2) 

where cj2|i is a compact symbol for a{X2\x^) and which is in general different from 
CJ2 = a{X2). In fact our uncertainty about X2 (for any possible value of Xi ) must be 
larger than that about Xi itself, for obvious reasons - we shall see later the details. 

2 . In the diagram on the center X^ might represent a second observation done indepen¬ 
dently applying in general a second (possibly different) instrument to the identical 
value Xi. This means that ^ 2 lxi and X^l^^ are independent, although X2 and X3 
are not , as we shall see. 

3 . In the diagram on the right X3 is the observation read on the instrument applies to 
Xi, but possibly influenced by X2, that might then represent a kind of systematics. 

Note, how it has been precisely stated, that X2 of the first and of the second diagrams, 
as well as X3 of the other two, are the readings on the instruments and not the result of 
the measurement! This is because by “result of the measurement” we mean statements 
about the quantity of interest and not about the quantities read on the instruments (think 
for example at the an experiment measuring the Higgs boson mass, making use of the 
information recorded by the detector!). In this case the “result of the measurement” would 
be f{xi I data,/) where data stands for the set of observed variables. 

The diagrams of the figure can be complicated, using sets of data, with systematics 
effects common to observations in each subset. The aim of this paper is to help in developing 
some intuition of what is going on in problems of this kind, with the only simplification 
that all pdf’s of interest are normal. 



2 Technical premises (with some exercises) 


We assume that the reader is familiar with some basic concepts related to uncertain numbers 
and uncertain vectors, usually met under the name of “random variables”. 


2.1 Normal (Gaussian) distribution 

X ~ AA(/i, a): 


f{x\M{^i,a)) - exp ^ „ 2^ 

V 2 vr (T L ^ 

( 3 ) 

E[X] = 

( 4 ) 

VarfA] = 

( 5 ) 

a[X] = ^JXai[X] = a. 

( 6 ) 


(We remind that in most physics applications x —>■ ±00 simply means \x — /r|/(T 3 > 1 .) 

In the R language [2] there are functions (dnormO, pnormO and qnormO, respectively) 
to calculate the pdf, the cumulative function, usually indicated with “F(x)”, as well as its 
inverse, as shown in the following, self explaining example^ (‘>’ is the R console prompt): 

> dnorm(0, 0, 1) 

[1] 0.3989423 

> l/sqrt(2*pi) # (just a check) 

[1] 0.3989423 

> pnormCO, 0, 1) 

[1] 0.5 

> pnorni(7, 5, 2) - pnorm(3, 5, 2) 

[1] 0.6826895 

> qnorni(0.5, 5, 2) 

[1] 5 

> qnormd, 5, 2) 

[1] Inf 

> qnormCO, 5, 2) 

[1] -Inf 

Note the capability of the language to handle infinities, as it can be cross checked by 

> pnormClnf, 5, 2) 

[ 1 ] 1 

And here are the instructions to produce the plots of figure [2l 

mu <- 5; sigma <- 2; x <- seq(mu-5*sigma, mu+5*sigma, len=101) 
plot(x, dnormCx, mu, sigma), ty=’l’, ylab=’f(x)', col=’blue’) 
points(x, dnorm(x, mu, sigma*!.5), ty=’l’, lty=2, col=’blue’) 
pointsCx, dnorm(x, mu, sigma*2), ty=’l’, lty=3, col=’blue’) 
plot(x, pnormCx, mu, sigma), ty=’l’, ylab=’F(x)', col=’red’) 
pointsCx, pnorm(x, mu, sigma*!.5), ty=’l’, lty=2, col=’red’) 
pointsCx, pnormCx, mu, sigma*2), ty=’l’, lty=3, col=’red’) 


^For information about the language see one of the many tutorial available on the web. Most functions 
we shall use here have self explaining names. For an help, for example about dnormO, just enter 
> ?dnorm 
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X 


Figure 2: Gaussian probability density function (above) and cumulative function (below) 
for /i = 5 and ct = 2, 3 and 4 (solid, dashed and pointed). 


2.2 Bivariate and multivariate normal distribution 

The joint distribution of a bivariate normal distribntion is given by 




1 


2 TT fJi fT2 y/l- PI 2 


exp 


1 




(xi - Plf 


O {xi - Pl){x2 - P 2 ) , {X2-P2f'' 
-2/912-^-o- 


where 


a 


(Tl (T2 (T 2 

X = {XI,X2) 

M = (/^l,/^2) 

E[Xi] = Pi 
Var[Xj] = af 
[X,] = ^\sx[Xi] = ai 

Cov[Xi,X2] 


P12 = 


(Tl (T2 


with variances and covariances forming the covariance matrix 


V = 


Var[Xi] Cov[Xi,X2] 
Cov[Xi,X2] Var[X2] 


(Tt 


Pl2Cri (T2 


PI 2 CTi (72 




(7) 


( 8 ) 

(9) 

( 10 ) 

( 11 ) 

( 12 ) 

(13) 


(14) 
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The bivariate pdf ([7]) can be rewritten in a compact form as 




(27r)-'^/2|y|-i/2 



nf V-^ {x 



(15) 


where | V| stands for det(l^). This expression is valid for any number n of variables and it 
turns, in the case V is diagonal, into 


1 

(27r)’*/2 Hi CTi 


exp 


- 2 


(16) 


(For an extensive, although mathematically oriented treatise on multivariate distribution 
see Ref. [3], freely available online.) 


2.2.1 Multivariate normals in R 

Functions to calculate multivariate normal pdf’s, as well as cumulative functions and ran¬ 
dom generators are provided in R via the package mnormtl^ that needs first to be installecfl 
issuing 

> install.packages("mnormt") 
and then loaded by the command 

> library(mnormt) 

Then we have to define the values of the parameters and built up the vector of the central 
values and the covariance matrix. Here is an example: 

> ml=0.4; m2=2; sl=l; s2=0.5; rho=0.6 

> mu <- c(ml, m2) 

> ( V <- rbind( c( sl''2, rho*sl*s2), c(rho*sl*s2, s2''2) ) ) 

[,1] [,2] 

[1,] 1.0 0.30 
[2,] 0.3 0.25 

Then we can evaluate the joint pdf in a point {xi,X 2 ), e.g. 

> dmnorm(c(0.5, 1.5), mu, V) 

[1] 0.1645734 

Or we can evaluate P{Xi < 0.5 X X 2 < 1.5), or P{Xi < & X 2 < fi 2 ), respectively, with 

> pmnorm(c(0.5, 1.5), mu, V) 

[1] 0.140636 

and 

> pmnorm(mu, mu, V) 

[1] 0.3524164 


2.3 Graphical representation of normal bivariates 

If we like to visualize the joint distribution we need a 3D graphical package, for example 
rgi or plot3Dl^ We need to evaluate the joint pdf on a grid of values ‘r’ and ‘y’ and 

^http://cran.r-project.org/web/packages/mnormt/ 

“^For all technical details about R (open source and multi-platform!) see the R web site [2]. 

®https://r-forge.r-project.org/projects/rgl/ 

® http://www.r-bloggers.com/3d-plots-in-r/ 
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provide them to the suited function. Here are the instructions that use the persp3d() of 
the rgl package: 

> library(rgl) 

> fun <- function(xl ,x2) dmnorin(cbind(xl, x2), mu, V) 

> xl <- seq(ml-3*sl, ml+3*sl, len=51) 

> x2 <- seq(m2-3*s2, m2+3*s2, len=51) 

> f <- outer(xl, x2, fun) 

> persp3d(xl, x2, f, col='cyan', xlab="xl", ylab="x2", zlab="f(xl,y2)") 

After the plot is shown in the graphics window, the window can be enlarged and the plot 
rotated at wish. Figure [3] shows in the upper two plots two views of the same distribution. 

Here are also the instructions to use plot3D(): 

> library(plot3D) 

> M <- mesh(xl, x2) 

> surf3D(M$x, M$y, f, bty=’b2’, phi = 30, theta = -20, 

+ xlab=’xl’, ylab='x2', zlab=’f(xl,x2)’) 

The result is shown in the lower plot of Fig. [3l 

Another convenient and often used representation of normal bivariates is to draw iso- 
pdf contours, i.e. lines in correspondence of the points in the plane (xi,X 2 ) such as 
f{xi,X 2 \I) = const. This requires that the quadratic form at the exponent of Eq. ([7]) 
[that is what is written in general as {x — V~^ (x — /r)] has a fixed value. In the two 

dimensional case of Eq. ©we recognize the expression of an ellipse. We have in R the 
convenient package ellipsqj to evaluate the points of such an ellipse, given the vector of 
expected values, the covariance matrix and the probability that a point falls inside it. Here 
is the script that applies the function to the same bivariate normal of Fig. [3l thus producing 
the contour plots of Fig. [D 

plot( ellipse(V, centre=mu, level=0.9973), ty=’l’, lty=2, col=’red’, 
asp=l, xlab=expression(x[1]), ylab=expression(x [2]) ) 
points( ellipse(V, centre=mu, level=0.99), ty=’l’, col=’blue’) 
points( ellipse(V, centre=mu, level=0.954), ty=’l’, lty=2, col=’red’) 
points( ellipse(V, centre=mu, level=0.5), ty=’l’, col=’blue’) 
points( ellipse(V, centre=mu, level=0.683), ty=’l’, lty=2, col=’red’) 
points( ellipse(V, centre=mu, level=0.90), ty=’l’, col=’blue’) 
points(mu[1], mu[2], pch=3, cex=1.5, col=’blue’) 
for(k in 1:3) { 

abline(v=mu[1]-k*sqrt(V[1,1]), lty=3, col=’magenta’) 
abline(v=mu[1]+k*sqrt(V[1,1]), lty=3, col=’magenta’) 
abline(h=mu[2]-k*sqrt(V[2,2]), lty=3, col=’magenta’) 
abline(h=mu[2]+k*sqrt(V[2,2]), lty=3, col=’magenta’) 

> 

The probability to find a point inside the ellipse contour is defined by the argument level. 
The ellipses drawn with solid lines define, in order of size, 50%, 90% and 99% contours. For 
comparison there are also the contours at 68.3%, 95.5% and 99.73%, which define the highly 
confusing 1-a , 2-a and 3 -it contours. Indeed, the probability that each of the variable falls 
in the interval of E[Xj]±fe c7[Xj] has little to do with these ellipses. If we are interested to the 
probability that a point falls in a rectangles defined by (E[Xi] ± k cr[Xi] &: E[X 2 ] ± k (y[X 2 \) 
the probability needs to be calculated making the integral of the joint distribution inside 
the rectangle (some of these rectangles are shown in Fig. 0] by the dotted lines, that indicate 
1-cr , 2-(T and 3-fT bound in the individual variable). 

’http://cran.r-project.org/web/packages/ellipse/ 
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Figure 3: Three views of the bivariate normal distribution obtained with the R code pro¬ 
vided in the text {fii = 0.4, /i 2 = 2, ui = 0.1, <72 = 0.5, pi 2 = 0.6). The above two are 
obtained by perp3d() of the package rgl, producing interactive 3D plots. The one below is 
produced by surf 3D () of the package plot3D. 














Xi 

Figure 4: Contour plots of the same bivariate normal of Fig. [3l The solid lines show the 
ellipses inside which there is, from the smaller to the larger, 50%, 90% and 99% probability 
that a point (xi,X 2 ) falls inside them. The dashed ellipses define instead the 68.3%, 95.5% 
and 99.73% probability contours [these are the (in-)famous l-ci, 2-a and 3-(T contours, not 
simply related to the standard deviations of the individual variable, whose l-u, 2 -ct and 3-a 
bounds are indicated by the dotted vertical and horizontal lines]. 


Let us see how to evaluate in R the probability that a point falls in a rectangle, making 
use of the cumulative probability function pmnormO. In fact the probability in a rectangle 
is related to the cumulative distribution by the following relation 

P[{xi^<Xi<Xi^)k{x2^<X2<X2j^)] = P[{Xi<Xi^)k{X2<X2j^)] 

-P[{X,<XiJkiX2<X2j] 
-P[{Xl<XiJk{X2<X2j] 
+P[iX,<xiJkiX 2 <X 2 ,)], (17) 

that can be implemented in an R function: 

p.rect.norm <- functionCxlim, ylim, mu, V, sigmas=FALSE, ...) { 

# The argument ’ ’ might be useful to pass extra arguments to pmnorm. 

if ( (length(mu) != 2) I sum( dim(V) != c(2,2) ) # some check 

I (length(xlim) != 2) I (length(ylim) != 2) ) { 
print("wrong dimensions in one of parameters") 
return(NULL) 

}■ else if ( sum( eigen(V)$values <= 0 ) > 0) { 

cat( sprintfC'V is not positively defined\n") ) 
return(NULL) 

} 

# If argument ’sigmas’ is TRUE: 

if( sigmas ) { # rectangular defined in units of individual sigma around mu 
xlim <- mu[l] + xlim * sqrt(V[l,l]) 
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ylim <- mu[2] + ylim * sqrt(V[2,2]) 

} 

library(mnormt) 

p.rect <- pmnormC c(xlim[2], ylim[2]), mu, V, ...) - 

pmnormC c(xlim[2], ylim[l]), mu, V, ...) - 

pmnormC c(xlim[l], ylim[2]), mu, V, ...) + 

pmnormC cCxlim[l], ylim[l]), mu, V, ...) 

returnCp.rect) 

> 

For exampl^ 

> p.rect.normCcCml-sl, ml+sl), cCm2-s2, m2+s2), mu, V) 

[1] 0.5138685 

> p.rect.normCcC-1, 1), cC-1, 1), mu, V, sigmas=TRUE) 

[1] 0.5138685 

As a cross check, let us calculate the probabilities in strips of plus/minus one standard 
deviations around the averages (the ‘strips’ provide a good intuition of what a ‘marginal’ 
is): 

> p.rect.normCcC-1, 1), cC-10, 10), mu, V, sigmas=TRUE) 

[1] 0.6826895 

> p.rect.normCcC-10, 10), cC-1, 1), mu, V, sigmas=TRUE) 

[1] 0.6826895 


2.4 Marginals (and ‘multivariate marginals’) of multivariate normals 

A nice feature of the multivariate normal distribution is that if we are just interested to a 
subset of variables alone, neglecting which value the other ones can take (‘marginalizing’), 
we just drop from and from V the uninteresting values, or the relative rows and columns, 
respectively. For example, if we have - see subsection 16.1.21 - 


M = 


1.96 \ 


/ 1.96 

-0.98 

0.98 

0.02 

V = 

-0.98 

0.99 

0.01 

1.98 / 


\ 0.98 

0.01 

1.99 


(18) 


marginalizing over the second variable (i.e. being only interested in the first and the third) 


we obtain 



/ 1.96 0.98 \ 
V 0.98 1.99 ) 


(19) 


Here is a function that returns expected values and variance of the multivariate ‘marginal’ 


marginal.norm <- functionCmu, V, x.m) { 

# x.m is a vector with logical values Cor non zero) indicating 

# the elements on which to marginalise Cthe others are 0, NA or FALSE) 
x.mfis.naCxm)] <- FALSE 

V <- whichC as.logicalCx.m) ) 
listCmu=mu [v] , V=V[v, v] ) 

} 

®For Monte Carlo oriented guys, here is how to cross check the results (don’t expect to reproduce 51313!): 

> xy <- rmnormC100000, mu, V) 

> length! xy[,l] [ xy[,l] > ml - si & xy[,l] < ml+sl & xy[,2] > m2 - s2 & xy[,2] < m2+s2 ] ) 
[1] 51313 
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(Note how the function has been written in a very compact form, exploiting some peculiar¬ 
ities of the R language. In particular, the elements of x.m to which we are interested can 
be TRUE, or can be a numeric value different from zero; the others can be FALSE, 0 or NA.) 

2.5 Conditional distribution of a variable, given its bivariate distribution 
with another variable 

A different problem is the pdf of one of variables, say Xi, for a given value of the other. 
This is not as straightforward as the marginal (and for this reason in this subsection we only 
consider the bivariate case). Fortunately the distribution is still a Gaussian, with shifted 
central value and squeezed width: 

^l\x2 ~ AA -F Pi2 ^ (X2 - P 2 ), pI^ , 

i.e. 

E[Xi] = Pi+ Pi2 — (X2 - P2) 

<^2 

Var[Ai] = o-f • (1 - PI 2 ) 
cj[Ai] = Ox- \j\- p\^. 

And, by symmetry, 

~ AA (^^2 + P \2 ^ (2:1 - ^1), 0-2^1 -pf2) • 

Mnemonic rules to remember Eqs. (j2ip and ()22p are 

• the shift of the expected value depends linearly on the correlation coefficient as well 
on the difference between the value of the conditionand (^ 2 ) and its expected value 
(/i 2 ); the ratio 01I02 can be seen as a minimal dimensional factor in order to get 
a quantity that has the same dimensions of pi (remember that Xi and X2 have in 
general different physical dimensions); 

• the variance is reduced by a factor which depends on the absolute value of the cor¬ 

relation coefficient, but not on its sign. In particular it goes to zero if |pi 2 | ^ 1, 
limit in which the two quantities become linear dependent, while it does not change 
if P 12 0, since the two variables become independent and they cannot effect each 

other. (In general independence implies p = 0. For the normal bivariate it is also true 
the other way around.) 

An example of a bivariate distribution (from [5], with xi and X 2 indicated as customary 
with X and y) is given in Fig. [5l which shows also the marginals and some conditionals. 


( 20 ) 

( 21 ) 

( 22 ) 

(23) 

(24) 


2.5.1 Evaluation of a conditional from a given bivariate normal 

As an exercise, lets prove (1201) . with the purpose of show some useful tricks to simplify the 
calculations. If we take literally the rule to evaluate f{xi X 2 \ I) knowing that f{xi,X 2 \ I) 
is given by ([7| we need to calculate 


f{xi I X2, I) 


f{xi,X2 1 1) 

/(^ 2 |/) 


(25) 
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The trick is to make the calculations neglecting all irrelevant multiplicative factors, starting 
from the whole denominator f{x 2 \ I), which is a number given X 2 = X 2 (whatever its value 
might be!). 

Here are the details (note that additive terms in the exponential are factors in the 
function of interest!)!^ 


f{xi I X 2 , I) oc f{xi,X 2 1 1) 


oc exp 

oc exp 

oc exp 

oc exp 

oc exp 

oc exp 


1 




(xi-fli)'^ {xi - Hi){x2 - fJ-2) , ix2-H2)'^ 

- — Ipi 2 -h 


CJi C72 


crn 


1 


(xi - Pif -2pi2 —(xi - Pl){x 2 - P 2 ) 

(T 2 

x {-2 piXi + p\-2pi2 — {X2 - P2) Xl 

(^2 

xi- 2 xi [pi + Pi2 — {X 2 - P 2 )\ 

(72 

xf-2xi [pi + pi2 — {X2 - P 2 )] + [Pl + P 12 — {X2 - P2)f 
(72 (72 


2(1-/^?2)^? L 

1 

'2{l-pl^)al [■ 

1 

'2{l-pl^)al L 

1 

'2{l-pl^)al L 
1 

' 2 {l-pl^)al 


Xl - [pi + pi2 — (X2 - P 2 )] 
(72 


(26) 


in which we recognize a Gaussian with expected value pi + pi 2 fj- {x 2 — P 2 ) and standard 
deviation cji 2 (and therefore the normalization factor can be obtained without any 

calculation). 


2.6 Linear combinations 

Linear transformations of variables are important because there are several practical prob¬ 
lems to which they apply. There are also other cases in which the transformation is not 
rigorously linear, but it can be still approximately linearized in the region of interest, where 
the probability mass is concentrated. There are well known theorems that relate expected 
values and covariance matrix of the input quantities to expected values and covariance ma¬ 
trix of the output quantities. The most famous case is when a single output quantity Y 
depends on several variables X. So, given 

y = (27) 

i 

there is a relation which always holds, no matter if the Xi are independent or not and 
whichever are the pdf’s which describe them: 


HY] = 


(28) 


^Essentially the trick consists in observing that if we have a pdf proportional to exp[—h^ {x^ -l-aa;)], 
then it is also proportional to 


exp 




= exp 




that is a Gaussian with u = —q;/ 2 and cr^ = 1/(2 h?). 
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In the special case that the Xj are also independent, we have 

Var[y] = ^c^VarfXi], (29) 

i 

Instead it is not always simple to calculate the pdf of Y in the most general case. There 
are however two remarkable cases, which we assume known and just recall them here, in 
which is Y is normally distributed: 

1. linear combinations of normally distribnted variables are still normal; 

2. the Central Limit Theorem states that if we have ‘many0 independent vari¬ 
ables their linear combination is normally distributed with variance equal to c? Var[Xj] 
if none of the non-normal components dominates the overall variance, i.e. if Cj Var[Xj] ^ 

cf Var[Xj], where j denotes any of those non-normal components. 

Since in this paper we only stick to normal pdf’s, the only task will be to evaluate the 
covariance matrix of the set of variables of interest, depending on the problem. 

The general transformation from n input variables to m output variable is given b£i| 

Yi = djXj , (30) 

or, in a compact form that use the transformation matrix C, whose elements are the Cjj, 


transformation rule is given by 


i.e. 


Y 

= cx. 


(31) 

matrix of the output quantities are 

given by 


E[y] 

= CY[X] 


(32) 

Vy 

= CVxC^ 


(33) 

, with (TXi 

= 0.2, ax 2 = 0-5 and 

pXi2 0.8, 

and the 

= 

X 1 + 2 X 2 


(34) 

>2 = 

-Xi + X 2 , 


(35) 

c = 

n 


(36) 


we get in R: _ 

^°The theorem says “for n that goes to infinity”! Some practice is then needed to judge when it is large 
enough - often n around 10 is can be considered ‘large’, in other cases even 10® is not enough! (Think of 
one million of variables described by a Poisson distribution with A = 10~®.) 

^^We neglect a possible extra constant term in the linear combination because this plays no role in the 
uncertainty. 

^^The function outer () produces by default a matrix which is by default is the outer product of two 
vectors, i.e. vi al'- But it has a third parameter FUN which which it is possible to evaluate different function 
on the ‘grid’ defined by the Cartesian product of the two vector. Try for example 

> outer(1:3, 1:3, ’+’) 

> outer(l:3, 1:3, function(x,y) x + y*2)) 

> round( outer(0:10, 0:10, function(x,y) sin(x)*cos(y)), 2 ) 
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> mu.X <- c(2, -3) 

> s.X <- c(0.2, 0.5) 

> rho.X <- -0.8 

> V.X <- outer(s.X, s.X) 

> V.X[1,2] <- V.X[2,1] <- V.X[l,2]*rho.X 

> V.X 

[,1] [,2] 

[1,] 0.04 -0.08 

[2,] -0.08 0.25 

> ( cor.X <- V.X / outerCs.X,s.X) ) 

[,1] [,2] 

[1,] 1.0 -0.8 
[2,] -0.8 1.0 

> ( C <- rbindC c(l,2), c(-l,l) ) ) 

[,1] [,2] 

[1,] 1 2 
[2,] -1 1 

> ( mu.Y <- as.vector( C 7,*7, mu.X ) ) 

[1] -4 -5 

> ( V.Y <- C 7.*7. V.X l*l t(C) ) 

[,1] [,2] 

[1,] 0.72 0.54 
[2,] 0.54 0.45 

> ( s.Y <- sqrt(diagCV.Y)) ) 

[1] 0.8485281 0.6708204 

> ( cor.Y <- V.Y / outer(s.Y,s.Y) ) 

[,1] [.2] 

[1,] 1.0000000 0.9486833 
[2,] 0.9486833 1.0000000 


Let us get a visual representation of the probability distribution of X and Y using this time, 
instead of iso-pdf ellipses, points in the X — Y plane produced by the random generator 
provided by the package mnormt (see result in Fig. [ 6 ]): 

> n=5000; r.X <- rmnorm(n, mu.X, V.X); r.Y <- rmnorm(n, mu.Y, V.Y) 

> plotCr.X, col=’magenta’, xlim=c(-7,2), ylim=c(-8,-1) , cex=0.2, 

+ asp=l, xlab=’Xl , Yl’, ylab=’X2 , Y2’) 

> points(r.Y, col=’cyan’, cex=0.2) 


2.7 Conditional distributions in many dimensions 

Instead, a less known rule is that which gives the covariance matrix of a conditional dis¬ 
tribution with a number of variables above two. For example we might have 5 variables 
Xi, X 2 , ■ ■ ■ X^ and could be interested in the expected values and the covariance matrix 
of (Xi, X 4 X 5 ), given {X 2 , X^). Problems of this kind might look a mere mathematical 
curiosity, but they are indeed important to understand how we learn from data and we 
make probabilistic predictions using probability theory. 

Compact formulae to solve this problems can be found in Ref. [3]. If we partition and 
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Figure 6: Monte Carlo sampling of two multivariate normal distributions (see text). 


V into the the subsets of variable on which we want to condition and the other ones, i.e. 

M = f!!') (37) 

V = (I (38) 


the result is 


E 


X 


1 1 X 2 = 0 , 


X 


1 IX 2 


Vii Vi2 
y 21 y 22 


= fj^i + yu ^22 (« - M2) 


= Fii - Fi 2 Wo'F 21 


(39) 


(40) 


(And analogous formulae for E ^2\Xi=b 


and Var 


^2lxi=6 


In the case of a bivariate distributions we recover easily Eqs. (ElD-dli]), as it follows. 

Expected value: y 12 is the off-diagonal term pi 20 ia 2 , while y 22 is equal to (tI- Eq. (j39p 
becomes then 

1 


E[Xi = ij,i+ pi 2 ai 02 —{a- P 2 ) 


On 


oi 


= Ml + M 12 — (a- M 2 ) 
02 


(41) 


Variance: The remaining two terms of interest are also very simple: V n is of, while V 21 , 
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equal to V 12 , is Pi 2 (^icr 2 - It follows 


Var[Xi|^J = - P12 fJl £72 —/912 <71 fT2 

'^2 

2 2 2 
— ai— Pi2 

= ^?(l-P?2)- (42) 

Note that, while the conditioned expected value depends on the conditionand vector a, 
the conditioned variance does not . 

3 R implementation of the rule to condition multivariate 
normal distributions 

At this point, having set up all our tools, here is the R function which implements the 
above formulaef^l 

norm.mult.cond <- function(mu, V, x.c, full=TRUE) { 
out <- NULL 
n <- length(mu) 

# Checks dimensions of mu and V 
if ( sum(dim(V) != n) ) { 

cat( sprintf("dimensions of V incompatible with length of mu\n") ) 
return(out) 

} 

# number of conditionand variables 
nc <- length(x.c[!is.na(x.c)]) 

# peculiar/anomalous cases 

if( (length(x.c) > n) I (nc > n) ) { 

cat( sprintf("x.c has more elements than mu\n") ) 
return(out) 

} else if (nc == 0) { # No condition 

out$mu <- mu 

out$V <- V 

return(out) 

} else if(nc == n) { 

out$mu <- x.c # exact values 

out$V <- NULL # covariance matrix is meaningless 

return(out) 

} 

# Apply Eaton’s formulae 

v.c <- which(!is.na(x.c)) # conditioning variables 
V <- whichds . na(x. c) ) # variables of interest 

Vll <- V[v, v] 

V22 <- Vfv.c, v.c] 

V12 <- V[v, v.c] 

V21 <- Vfv.c, v] 

it will be mentioned in the footnote 1231 of Sec. [M a more numerically stable way to invert a matrix 
in R would be using the Choleski decomposition, but for the purpose of this note the difference is slightly 
appreciable. 
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mu.cond <- mu[v] + V12 7o*°/o solve (V22) (x. c [! is .na(x. c)] - mu[v.c]) 

V.cond <- VI1 - V12 7.*7. solve(V22) 7.*7. V21 
if(!full) { # returns only interesting part 
out$mu <- as.vector(mu.cond) 
out$V <- V.cond 

} else { # returns all (better to understand!I) 

mul <- mu 
VI <- V 

mul[v] <- mu.cond 

mul[v.c] <- X.c[I is.na(x.c)] 

VI [v, v] <- V.cond 
Vl[v.c, v.c] <- 0 
Vl[v, v.c] <- 0 

Vl[v.c, v] <- 0 

out$mu <- as.vector(mul) 
out$V <- VI 

> 

return(out) 

} 

The conditionand vector x. c has to contain numbers in the positions corresponding to the 
variables on which we want to condition, and NA, that is ‘not available’ or ‘unknown’, in the 
others, as we shall see in the examples. The code of parameter full is to return the vector 
of expectation and the covariance having the initial dimensionality. The expectation of 
the variable used as condition is the condition itself. All elements of the covariance matrix 
related to conditionals are instead zero, and the utility of this convention will be clear going 
through the examples. 

Let us try with a simple case of two normal quantities Hx = (2, —3) of section [T6l The 
question is how our uncertainty on fj,Xi change if we assume nx 2 = — 2: 

> ( V.X.cond <- norm.mult.cond(mu.X, V.X, c(NA, -2)) ) 

$mu 

[1] 1.68 -2.00 

$V 

[,1] [,2] 

[1,1 0.0144 0 
[2,1 0.0000 0 

> sqrt(diag(V.X.cond$V)) 

[1] 0.12 0.00 

The effect of the conditions to shift the expected value of ^Xi from 2 to 1.68 and to squeeze 
its standard uncertainty to 0.12. If we provide our result in the conventional form “expected 
value ± standard uncertainty”, the assumption (or ‘knowledge’) X 2 = —2 updates our 
‘knowledge’ about Xi from ‘2.00 ± 0.20’ to ‘1.67 ± 0.12’. 
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4 The ‘simplest experiment’ 

Let us go back to the first diagram of Fig. [H that we repeat here for convenience: 



This diagram describes the situation in which we have the physical quantity Xi, that is a 
parameter of our physical model of reality, and the reading on an instrument, X 2 , caused 
by Xi. 

The instrument has been well calibrate, such to give X 2 around Xi, but it is not 
perfect, as usual. In other words, even if we knew exactly the value xi we were not sure 
about the value X 2 we would read. For simplicity, let us model this uncertainty by a normal 
distribution, i.e. 


~ AA(Xi,ct2|i). (43) 

But we usually do not know Xi, and therefore we are even more uncertain about what we 
shall read on the instrument. In fact we are dealing with a joint distribution describing the 
joint uncertainty about the two quantities, that is 


f{xi,X 2 \I) = f{x 2 \xi, I) ■ f{xi\I). (44) 

Our knowledge about X 2 will be given, instead, by /(x 2 |/) = X 2 \I)dxi, a 

distribution characterized by Var[X 2 ] / Xa.T:\X 2 \x^- 

It is convenient to model our uncertainty about Xi with a normal distribution, with a 
standard deviation ui much larger than ct 2 |i ^ if we make a measurement we want t o g ain 
knowledge about that quantity! - and centered around the values we roughly expect o 
In order to simplify the calculations, in the exercise that follows let us assume that Xi 
is centered around zero. We shall see later how to get rid of this limitation. 

The joint distribution /(xi, X 2 \ I) is then given by 


/(Xi, X2 1 1 ) 


vDr iT2|] 


exp 


(X2 - Xl)" 


y/2^ 


<yi 


exp 


xi 

2 erf 


(45) 


As an exercise, let us see how to evaluate /(xi, X 2 | /)• The trick, already applied before, 
is to manipulate the terms in the exponent in order to recover a well known pattern. Here 


^^For extensive discussions about modelling prior knowledge of physical quantities see Ref. [5] and refer¬ 
ences therein. As a practical example, think at the width of the table at which a sit in the very moment 
you read these lines (or any other object), and about the reading on a ruler when you try to measure it. 
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are the details, starting from (I45|) rewritten dropping all irrelevant factors: 

ff in {x2-xif 

f{xi,X 2 \I) oc exp 


oc exp 


oc exp 


oc exp 


oc exp 


oc exp 



(46) 

(47) 

(48) 

(49) 

(50) 

(51) 


In this expression we recognize a bivariate distribution centered around (0,0), provided we 
interpret 


cj2|i + af 

^211 
1 + cjf 


(To 


1 Pl 2 5 


(52) 

(53) 


and after having checked the consistency of the terms multiplying xi X 2 - Indeed we have 


Pl 2 


a, 


= 1 - 


2|1 


TT 


(Ton + (Xl 


Pl2 = 


' 2|1 


(T^ + af 


2|1 


a. 

0-2 


2|1 


+ ^1 


and then the second term within parenthesis can be rewritten as 

2xiX2 2xiX 2 2pi2XiX2 


0 -2,, + cjf 


Then 


2|1 


f{xi,X 2 \I) OC exp 


(T 2 • (T 2 


1 


(Tl • (T2 


2 ( 1 -P? 2 ) Wi 

is definitively a bivariate normal distribution with 


x{ 2pi2XiX2 ^X2 
(Tl ■ 02 (To 


= 


V = 


0 

0 

<Xl 


of + (T,^ 


2|1 


(54) 

(55) 

(56) 

(57) 

(58) 

(59) 
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As a cross check, let us evaluate expected value and variance of X 2 if we assume a certain 
value of Xi, for example Xi = xi: 

2 

E[^2|xi=xi] = 0 +^ • (xi - 0) = Xi (60) 

2 

Var[X2|;^^=^J = u? + ^ aj = , (61) 

as it should be: provided we know the value of Xi our expectation of X 2 is around its value, 
with standard uncertainty (T 2 |i. 

More interesting is the other way around, that is indeed the purpose of the experiment: 
how our knowledge about Xi is modified by X 2 = X 2 - 


nxi\x,=j 


= 0 + 


(Tt 


Cjf + cr, 


(X2 -0) = X2 


1 


2|1 






erf + a. 


2|1 


■ 


1 + 
1 


l + a^ar 


(62) 

(63) 


Contrary to the first case, this second result is initially not very intuitive: the expected 
value of Xi is not exactly equal to the ‘observed’ value X 2 , unless cJi, that models our 
prior standard uncertainty about Xi, is much larger than the experimental resolution cr 2 |i. 
Similarly, the final standard uncertainty is in general a smaller than ct 2 |i, unless, again, 
^ 10 Although initially surprising, these result are in qualitative agreement with 
the good sense of experienced physicists [5]. 

^®You might recognize in Eq. (|63l) 


1 



which stems naturally from probability theory and tells how a new observation squeezes the uncertainty on 
the true value of a quantity. Indeed, it easy to show that Eq. can be written, as rather well known (see 
e.g. Ref. [^), as 




E[Xi] ■ X2 cTjm 


+ O'o 


weighted average of the prior expected value and observation, with weights equal to the prior variance and 
the instrument variance. 

En passent we can rewrite Eqs. (l62j-((63} as 


or 




Var[Xi|^^_J 


E[Xi] + 


erf 


erf + CT. 


2|l 


■ {X2 - E[Xi]) 


2 


0-1 


erf 


erf + CT, 


2|l 


2 

eri , 


E[^ilx,=.J = HX^] + kix2-ElX^]) 
Var[Xi = erf - A: (jf = (jf (1 - fc), 


with 


k = 


2 

eri 


erf + cr; 


2|l 


in order to emphasize the Kalman filter’s updating rules. [3] 
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5 Several independent measurements on the same physics 
quantity 

The next step is to see what happens when we are in the conditions to make several 
independent measurements on the same quantity Xi, possibly with different instruments, 
each one characterized by a conditional standard uncertainty Ujp and perfectly calibrated, 
that is E[-^i|jVi=a:i] ~ ^ 1 - "Ehe situation can be illustrated with the diagram at the center 
of Fig. dl reported here for convenience, extended to other observations: 



We have learned that if we are able to build up the covariance matrix of the joint distribution 
f{xi,X 2 ,X 3 ,... I /) the problem is readily solved, at least in the normal approximations we 
are using throughout the paper. 

In principle we should repeat the previous exercise to evaluate, sticking to the hrst two 
observations X 2 and X 3 , 

/(xi,X2,X3 I/) = /(xi I/) • /(X2 I Xi, /) • /(X3 I Xi,X2 /) (64) 

= /(a^i I ^) •/(a ;2 I xi,/) •/(x3 I xi,/), (65) 

where in the last step we have made explicit that /(xs | xi I) does not depend on X 2 , once 
Xi is known. But this does not implies that X 2 and X 3 are independent, as we shall see 
later! They are simply conditionally independent, i.e. independent under the condition (to 
be meant in general as an hypothesis) that Xi has a precisely known value. 

In reality we do not need to go through a similar derivation, that indeed was just 
an exercise. The easy solution arises, going back to the previous case, noting that the 
observation Oi is the sum of the value of the physics quantity v and the instrumental error 
Ci (a ‘noise’, as you might like to see it), i.e. 

Oi = V + ei, (66) 

with Ci modelled, as usual, by a normal distribution, that is, in general 

Ci ~ W(0,(TeJ. (67) 

The general uncertain vector X will be then X = (u, 01 , 02 ), as clarified by the following 
diagram: 
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These are then the hrst terms the transformation rules; 


( 68 ) 

(69) 

(70) 


Xi = V 

X 2 = 0 i = u + ei 

-^3 = 02 = U + 62 

from which the calculation of the covariance matrix is straightforward: 

• the i-th element diagonal is given by the variance of Xj, that is al, (aj + and 
so on; 

• the off-diagonal elements are all equal to af, because the only element in common in 
all linear combinations is v. 

Hence here is the covariance matrix of interest; 


V 





(71) 


5.1 Getting some insights with numerical examples 


At this point, instead of trying to get analytic formulae for all conditional probabilities of 
interest, we prefer to use the properties of the multivariate normal distribution implemented 
in the function norm.mult. cond() seen before. And, since the game is now automatic, 
we enlarge our space to 6 variables, Xi for the ‘true value’ and X 2 -Xq for four possible 
‘readings’. Although it is not any longer needed, we still set out prior central value about 
Xi around 0, which is equivalent to set to 0 all expected values. (For didactic purposes 
we have set ai only 10 larger than the experimental resolutions cjj|i, as we shall discuss 
commenting the results.) Here is the R code, with some comments: 


> n=6; muXl=0; sigmaXl=10 # set size and initial uncertainty on XI 

> mu <- repCmuXl, n) # set expected values (all equal!) 

> ( sigma <- cCsigmaXl, rep(l,n-l)) ) # standard deviations 

[ 1 ] 10 1 1 1 1 1 

> V <- matrix(rep(sigma[l]~2, n*n), c(n,n)) 

> diag(V)[2:n] <- diag(V)[2;n] + sigma[2;n]"2 

> V # covariance matrix 



[,1] 

[,2] 

[,3] 

[,4] 

[,5] 

[,6] 

[1,] 

100 

100 

100 

100 

100 

100 

[2,1 

100 

101 

100 

100 

100 

100 

[3,] 

100 

100 

101 

100 

100 

100 

[4,1 

100 

100 

100 

101 

100 

100 

[5,] 

100 

100 

100 

100 

101 

100 

[6,] 

100 

100 

100 

100 

100 

101 


> (su <- sqrt(diag(V))) # standard deviations 

[1] 10.00000 10.04988 10.04988 10.04988 10.04988 10.04988 

> V/outer(su,su) # correlation matrix 

[,1] [,2] [,3] [,4] [,5] [,6] 

[1,] 1.0000000 0.9950372 0.9950372 0.9950372 0.9950372 0.9950372 

[2,] 0.9950372 1.0000000 0.9900990 0.9900990 0.9900990 0.9900990 

[3,] 0.9950372 0.9900990 1.0000000 0.9900990 0.9900990 0.9900990 

[4,] 0.9950372 0.9900990 0.9900990 1.0000000 0.9900990 0.9900990 

[5,] 0.9950372 0.9900990 0.9900990 0.9900990 1.0000000 0.9900990 

[6,] 0.9950372 0.9900990 0.9900990 0.9900990 0.9900990 1.0000000 
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As we can see, all variables are correlated! The reason is very simple; any precise informa¬ 
tion we get about one of them changes the pdf of all others. In physics terms, a reading on 
a instrument changes our opinion about the value of the quantity of interest as well as of 
all other readings we have not yet done (or we not yet aware of their values -- in probability 
theory what matters is not time ordering, but ignorance). 

Let us now see what happens if we condition on a precise value of the true value 
Xi, for example Xi = 2: 

> ( mu.c <- c(2, rep(NA, n-1)) ) # conditionand 

[1] 2 NA NA NA NA NA 

> ( out<- norm.mult.cond(mu, V, mu.c) ) # resulting multivariate 

$mu 

[1] 2 2 2 2 2 2 
$V 

[,1] [,2] [,3] [,4] [,5] [,6] 

[1,] 0 0 0 0 0 0 

[2,] 0 1 0 0 0 0 

[3,] 0 0 1 0 0 0 

[4,] 0 0 0 1 0 0 

[5,] 0 0 0 0 1 0 

[6,] 0 0 0 0 0 1 

As we see, the expected values are all equal, is not longer uncertain, and all other 
variables become independent, more precisely “conditional independent” 

Let’s now see what happens if we condition instead on the observation X 2 = 2; 

> ( mu.c <- c(NA, 2, rep(NA, n-2)) ) 

[1] NA 2 NA NA NA NA 

> ( out<- norm.mult.cond(mu, V, mu.c) ) 

$mu 

[1] 1.980198 2.000000 1.980198 1.980198 1.980198 1.980198 
$V 

[,1] [,2] [,3] [,4] [,5] [,6] 

[1,] 0.990099 0 0.990099 0.990099 0.990099 0.990099 

[ 2 ,] 0.000000 0 0.000000 0.000000 0.000000 0.000000 

[3,] 0.990099 0 1.990099 0.990099 0.990099 0.990099 

[4,] 0.990099 0 0.990099 1.990099 0.990099 0.990099 

[ 5 ,] 0.990099 0 0.990099 0.990099 1.990099 0.990099 

[6,] 0.990099 0 0.990099 0.990099 0.990099 1.990099 

> ( out.s <- sqrt(diag(out$V)) ) # standard deviations 

[1] 0.9950372 0.0000000 1.4107087 1.4107087 1.4107087 1.4107087 

> out$V / outer(out.s, out.s) # correlation matrix (besides NaN) 

[,1] [,2] [,3] [,4] [,5] [,6] 

[1,] 1.0000000 NaN 0.7053456 0.7053456 0.7053456 0.7053456 

[2,] NaN NaN NaN NaN NaN NaN 

[3,] 0.7053456 NaN 1.0000000 0.4975124 0.4975124 0.4975124 

[4,] 0.7053456 NaN 0.4975124 1.0000000 0.4975124 0.4975124 

[5,] 0.7053456 NaN 0.4975124 0.4975124 1.0000000 0.4975124 

[6,] 0.7053456 NaN 0.4975124 0.4975124 0.4975124 1.0000000 

The ‘measurement’ has had the effect of changing all our expectations, becoming all ‘prac¬ 
tically equal’ to the observed value of 2. But the uncertainties about the possible ’future 


24 




X 

Figure 7 : Normal distributions describing our uncertainty abont Xi and X3 before (dashed 
line) and after (solid line) the observation X2 = 2 (see text). 


observations’ are different than that of the true value Xi. They are in fact larger by a 
factor y /2 (see also Fig. ED- The reason is that X2 and X3 (i.e. oi and 02) and all other 
possible readings 03, 04 and 05 ‘communicate’ each other via Xi: their uncertainty is than 
the combination (quadratic combination!) of that assigned to Xi and that of the readings 
Xi if we new exactly Xi (that is fTeJ. 

Let us see if we add another observation, e.g. X 3 = 1, that is we recondition now 
simultaneously on X2 = 2 and X2 = 1 

> mu.c <- c(NA, 2, 1, NA, NA, NA) 

> ( out<- norm.mult.cond(mu, V, mu.c) ) 

$mu 

[1] 1.492537 2.000000 1.000000 1.492537 1.492537 1.492537 


$V 

[,1] [,2] [,3] 


[1.] 

0.4975124 

0 

0 

0 

[2,] 

0.0000000 

0 

0 

0 

[3,] 

0.0000000 

0 

0 

0 

[4,] 

0.4975124 

0 

0 

1 

[5,] 

0.4975124 

0 

0 

0 

[6,] 

0.4975124 

0 

0 

0 


[,4] [,5] [,6] 
4975124 0.4975124 0.4975124 
0000000 0.0000000 0.0000000 
0000000 0.0000000 0.0000000 
4975124 0.4975124 0.4975124 
4975124 1.4975124 0.4975124 
4975124 0.4975124 1.4975124 


> ( out.s <- sqrt(diag(out$V)) ) 

[1] 0.7053456 0.0000000 0.0000000 1.2237289 1.2237289 1.2237289 


> out$V / outer(out.s, out.s) 
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[,1] [,2] [,3] [,4] [,5] [,6] 


[T 

,] 

1, 

.0000000 

NaN 

NaN 

0 

[2, 

,] 


NaN 

NaN 

NaN 


[3, 

,] 


NaN 

NaN 

NaN 


[4, 

,] 

0, 

.5763904 

NaN 

NaN 

1 

[5, 

,] 

0, 

.5763904 

NaN 

NaN 

0 

[6, 

,] 

0, 

.5763904 

NaN 

NaN 

0 


5763904 0.5763904 0.5763904 
NaN NaN NaN 

NaN NaN NaN 

0000000 0.3322259 0.3322259 
3322259 1.0000000 0.3322259 
3322259 0.3322259 1.0000000 


As we can see, after the second observation the expected values are practically equal 
to 1.5, average between the two readings. The uncertainty about the true value has de¬ 
creased by a factor 1.41, that is \/2, while the uncertainties about the forecasting decrease 
only by a factor 1.15, going from 1.41 to 1.22. This latter number can be understood as 
Vo.705^ -|- 1^ = 1.22, as it will be justified in a while. 

Let us see what happens if we suppose that also Xi is precisely known, namely Xi = 3 
(different from Xi = 2 previously used, not only “just to change” but also to use a value 
different from that of X 2 and X3): 


> mu.c <- c(3, 2, 1, NA, NA, NA) 

> ( out<- norm.mult.cond(mu, V, mu.c) ) 
$mu 

[1] 3 2 1 3 3 3 


[, 6 ] 

O.OOOOOOe+00 

O.OOOOOOe+00 

O.OOOOOOe+00 

2.302158e-12 

2.302158e-12 

l.OOOOOOe+00 


$V 

1 

[,1] 1 

:.2] 1 

[,3] 

[,4] 

[,5] 

[1.] 

0 

0 

0 

O.OOOOOOe+00 

O.OOOOOOe+00 

[2,] 

0 

0 

0 

O.OOOOOOe+00 

O.OOOOOOe+00 

[3,] 

0 

0 

0 

O.OOOOOOe+00 

O.OOOOOOe+00 

[4,] 

0 

0 

0 

l.OOOOOOe+00 

-2.302158e-12 

[5,] 

0 

0 

0 

-2.302158e-12 

l.OOOOOOe+00 

[6,] 

0 

0 

0 

-2.302158e-12 

-2.302158e-12 


If Xi is perfectly known the observations X 2 and A3 are irrelevant, as it has to be0 

Finally, going back to the physical case of interest, in which Xi is unknown, let us add 

a third observation, e.g. A4 = 0 

> mu.c <- c(NA, 2, 1, 0, NA, NA) 

> ( out<- norm.mult.cond(mu, V, mu.c) ) 

$mu 

[1] 0.9966777 2.0000000 1.0000000 0.0000000 0.9966777 0.9966777 


$V 

[ 1 ,] 

[ 2 ,] 

[3,] 

[4,] 

[5,] 

[ 6 ,] 


[,1] [,2] [,3] [,4] 


0.3322259 

0.0000000 

0.0000000 

0.0000000 

0.3322259 

0.3322259 


[,5] 

0.3322259 

0.0000000 

0.0000000 

0.0000000 

1.3322259 

0.3322259 


[, 6 ] 

0.3322259 

0.0000000 

0.0000000 

0.0000000 

0.3322259 

1.3322259 


> ( out.s <- sqrt(diag(out$V)) ) 

[1] 0.5763904 0.0000000 0.0000000 0.0000000 1.1542209 1.1542209 

^®And if X2 and X3 are ‘very far’ from Xi? In this simple model we are using, there is little to do, because 
any observation from minus infinite to plus infinite is never incompatible with a any Gaussian. But we know 
by experience that something strange might be happened. It this case we need to put in mathematical form 
the model we have in mind. 
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> out$V / outer(out.s, out.s) 



[,1] 

[,2] 

[,3] 

[,4] 

[,5] 

[,6] 

[1,] 

1.0000000 

NaN 

NaN 

NaN 

0.4993762 

0.4993762 

[2,] 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

[3,] 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

[4,] 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

[5,] 

0.4993762 

NaN 

NaN 

NaN 

1.0000000 

0.2493766 

[6,] 

0.4993762 

NaN 

NaN 

NaN 

0.2493766 

1.0000000 


As we can see, the value of Xi is with very good approximation the average of the three 
observations, that is 1, with a the standard uncertainty decreasing with Ijy/n, passing from 
1.00 to 0.71 to 0.58. This is because the three pieces of information enter with the same 
weight, since cjj|i, related to the ‘precision of the instrument’, is the same in all cases and 
equal to 1. 

As far as the prediction of future observations, obviously they must be centered around 
the value we believe Xi is, at the best of our knowledge, a value which changes with the 
observations. As far as uncertainty and correlation coefficient are concerned, they decrease 
as follows (starting from the very beginning, before any observation); 

Standard uncertainty: 10.05, 1.41, 1.22, 1.15. 

We can see that they are a quadratic combination of the uncertainty with which we 
know Xi and that with which we expect the observation given a precise value of Xi. 
If we indicate the state of information at time t as /(t), the rule is 

Var[W|/(t)] = Var[Ai I/(t)] + aj,. (72) 

Asymptotically, when after many measurements the determination of Xi is very ac¬ 
curate, it only remains Ug., as it has to be. 

Correlation coefficient: 0.990, 0.50, 0.33, 0.25. 

It is initially very high because any new observation changes dramatically our expec¬ 
tation about the others. But then, when we have already made several observations, 
a new one has only very little effect on our forecasting. Asymptotically, when we have 
made a very large number of observations and Xi is very well ‘determined’, all future 
observations become essentially “conditionally independent”. 

5.2 Follows up 

At this point the game can be continued with different options. One has only to re-build 
the initial covariance matrix and play changing the conditions. 

An interesting exercise is certainly that of increasing ui, for example to 100, i.e. 100 
times large than the ‘precision’ of our instrument, or even 1000, to see how our conclusions 
change. The result will be that true value and future measurements are ‘practically’ only 
determined by the observations. 

It could also interesting to see what happens if the different observations come from 
instruments having different precisions. 

Finally, one could produce a (relatively) large random sample of observations measuring 
the same true value. Being m the number of observations, the dimensionality of our problem 
will be m = n -b 3, because we have to add - obviously - Xi and we want to have at least 
two future observations (Ai+m-i-i and Ai+m+i) in order to check their degree of correlation. 
Here is the R session in which we have been playing with a sample of 100 observations (for 
obvious reasons we shall focus only on the uncertain variables, i.e. Xi, Xioi and A 102 ): 
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> m <- 100; n <- m + 3 # dimensionality of the problem 

> mu <- rep(0,n) 

> sigma <- c(10, rep(l,n-l)) 

> V <- matrix(rep(sigma[l]~2, n*n), c(n,n)) 

> diag(V)[2:n] <- diag(V)[2:n] + sigma[2:n]"2 

> ( XI.God <- 2 ) # the exact value of the quantity we are going to measure 

[ 1 ] 2 

> sample <- rnorm(m, XI.God, sigma[2]) # random sample 

> mean(sample) # saunple mean 

[1] 2.094649 

> mu.c <- c(NA, sample, NA, NA) 

> out <- norm.mult.cond(mu, V, mu.c) # no printouts, for obvious reasons 

> ( out <- marginal.norm(out$mu, out$V, c(l, rep(0, m), 1, 1)) ) # interesting part 
$mu 

[1] 2.094439 2.094439 2.094439 
$V 

[,1] [,2] [,3] 

[1,] 0.009999 0.009999 0.009999 
[2,] 0.009999 1.009999 0.009999 
[3,] 0.009999 0.009999 1.009999 

> ( su <- sqrt(diag(out$V)) ) 

[1] 0.099995 1.004987 1.004987 

> out$V /outer(su,su) 

[,1] [,2] [,3] 

[1,] 1.00000000 0.09949879 0.09949879 
[2,] 0.09949879 1.00000000 0.00990001 
[3,] 0.09949879 0.00990001 1.00000000 


Expected values of the true value and of the future measurements are now equal to the 
average of the sample, with excellent approximation. This is due to the fact that the initial 
uncertainty of 10 is in this case much larger than the final one of 0.10. This value is indeed 
equal to = 1/10, the famous standard deviation of the mean. This means that the 

standard deviation of the sample, that is 
> sd(sainple) 

[1] 0.08263812 

is not used. This is not a surprise, since in our model are assumed to be perfectly 
known 0 

We see that the uncertainty on the future observations is a bit larger than that on the 
true value, as it must be. This is because they depends on the uncertain value of the true 
value and the experimental resolution, combining in quadrature (VO.l^ + 1^ = 1.00499). 
The correlations become small, in particular those among the future observations, which 
practically become ‘conditionally independent’. Indeed, the covariance matrix is that shown 
in Eq. (El), with (Ti replaced by |s Rm p]e (what matters is the uncertainty about Xi, not 
its source!). 


'A model that would allow to infer the cri|i’s is not any longer linear, thus going beyond the purpose of 
this note. 
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Figure 8 : Diagrams to model a systematic effect. 


6 Adding a simple systematic effect (‘offset’ type) 


Let us now move to the second diagram of Fig. [H which we repeat 


her for convenience, in 



which X 3 is caused by both Xi and X 2 ■ This diagram can model the presence of a systematic 
effect, because we expect that the possible values of X 3 are caused by both Xi and X 2 , and 
it will be then influenced by how uncertain is the quantity X 2 that acts as a systematic. The 
simplest case of systematic effect is an additive one, of unknown value, but with expected 
value 0 (the instrument has been calibrated ‘at the best’!) and a standard uncertainty 
(T 2 . Needless to say, we also model this uncertainty with a normal distribution, with much 
simplihcation in the calculations (and also because this is often the case). 

The model can be extended to several observations, as shown in the left diagram of 
Fig. EJ In the figure it is also shown a different interpretation of the effect of the systematic 
error, which is very close to the physicist intuition. The observations X 3 , X 4 and X 5 
are normally distributed around a kind of ‘virtual state’ Xy determined by the unknown 
true value Xi and the unknown offset X 2 , i.e. the true ‘zero’ of the instrument. The 
transformation rule to build the initial covariance matrix will be then, starting from symbols 
that have a physical meaning [value v, ‘zero’ z, and the others as in Eqs. (I70l) - (f70] 



= V 

(73) 

^2 

= Z 

(74) 


{Xv = v + z = Xi+ X2) 

(75) 

^3 

= Oi = Xy + 63 = Xi + X2 + 63 

(76) 

X4 

= 02= Xy + 64 = Xi + X2 + 64 

(77) 

^5 

= 03= Xy + 65 = Xi + X2 + 65 

(78) 
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The calculation of the variances is trivial. As far as the covariances we have 


Cov[Xi,A2 ] 

= 0 


(79) 

Cov[Ai,Xi] 

= 

(i>2) 

(80) 

Cov[A2,Xi] 

= 

(i>2) 

(81) 

Cow[Xi,Xj] 

= crf + al 

(i >2, j > 2) 

(82) 


This is then the covariance matrix of interest, limited to the five variables shown in the 
figure (and than it is easy to continue): 



0 




\ 

0 

^2 

^2 

^2 

^2 



^2 

(jf + 

erf + erf 

erf +erf 


o-f 

0-2 

af + ct| 

o-f + erf + al 

erf + 0-1 



^2 

erf + ct| 

erf + erf 

fjf + fjf + 

/ 


where Ueg stands for ( 7 ( 63 ), i.e. and so on, later also indicated with the short 

hand 

From such an interesting matrix we can expect interesting results, useful to train our 
intuition. But before analyzing some cases, as done in the previous section, let us make 
the exercise to build up the covariance matrix in a different way. The transformation rules 
()73l) - ()78p can be rewritten using the transformation matrix 


C 


/ 1 0 0 0 0 \ 
0 10 0 0 
1110 0 
110 10 
V1 1 0 0 1 / 


(84) 


to be applied to the diagonal matrix of the independent variables, 


Vo 


/ o-f 0 0 0 \ 

0 0 0 0 

0 0 0 
0 0 0 0-2^ 0 

V 0 0 0 Q alj 


(85) 


Applying then the transformation rule of the covariance matrix we reobtain the above result 
- an implementation in R will be shown in the next subsection. 


6.1 Numerical examples 


Let set up the covariance matrix for 5 possible ‘observations’ 

> n=7; muXl=0; sigmaXl=10; muZ=0; sigmaZ=l # set parameters 

> mu <- c(muXl, muZ, rep(muXl+muZ, n-2)) # set expected values 

> ( sigma <- cCsigmaXl, sigmaZ, rep(l,n-2)) ) # standard deviations 

[ 1 ] 10 1 1 1 1 1 1 


> V <- matrix(rep(0, n*n), c(n,n)) # cov matr # step 0 

> V[(l:n)[-2], (l:n)[-2]] <- sigma[l]~2 # step 1 

> V[(2:n), (2:n)] <- V[(2:n), (2:n)] + sigma[2]~2 # step 2 

> diag(V)[3:n] <- diag(V)[3:n] + sigma[3:n]"2 # step 3 
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> V 




[dl 

[,2] 

[,3] 

[,4] 

[,5] 

[,6] 

[,7] 

[1 

,] 

100 

0 

100 

100 

100 

100 

100 

[2 

,] 

0 

1 

1 

1 

1 

1 

1 

[3 

,] 

100 

1 

102 

101 

101 

101 

101 

[4 

,] 

100 

1 

101 

102 

101 

101 

101 

[5 

,] 

100 

1 

101 

101 

102 

101 

101 

[6 

,] 

100 

1 

101 

101 

101 

102 

101 

[7 

,] 

100 

1 

101 

101 

101 

101 

102 


> (su <- sqrt(diag(V))) 

[1] 10.0000 1.0000 10.0995 10.0995 10.0995 10.0995 10.0995 

> roundC V/outer(su,su), 4) 

[,1] [,2] [,3] [,4] [,5] [,6] [,7] 

[1,] 1.0000 0.000 0.9901 0.9901 0.9901 0.9901 0.9901 

[2,] 0.0000 1.000 0.0990 0.0990 0.0990 0.0990 0.0990 

[3,] 0.9901 0.099 1.0000 0.9902 0.9902 0.9902 0.9902 

[4,] 0.9901 0.099 0.9902 1.0000 0.9902 0.9902 0.9902 

[5,] 0.9901 0.099 0.9902 0.9902 1.0000 0.9902 0.9902 

[6,] 0.9901 0.099 0.9902 0.9902 0.9902 1.0000 0.9902 

[7,] 0.9901 0.099 0.9902 0.9902 0.9902 0.9902 1.0000 

Let us also show the alternative way to build up the covariance matrix 

> C <- matrix(rep(0, n*n), c(n,n)) # transf. matrix 

> C[,l] <- c(l, 0, repd, 11-2)) 

> C[,2] <- c(0, repd, n-1)) 

> diag(C) <- repd, n) 

> C 



[d] 

[,2] 

[,3] 

[,4] [ 

,5] 

[,6] 

[,7] 


[1,] 

1 

0 

0 

0 

0 

0 

0 


[2,1 

0 

1 

0 

0 

0 

0 

0 


[3,] 

1 

1 

1 

0 

0 

0 

0 


[4,1 

1 

1 

0 

1 

0 

0 

0 


[5,] 

1 

1 

0 

0 

1 

0 

0 


[6,] 

1 

1 

0 

0 

0 

1 

0 


[7,1 

1 

1 

0 

0 

0 

0 

1 


> VO 

<- matrix! 

rep(0 

, n*n) 

, c 

(n,n)) 

# 

initial diagonal matrix 

> diag(VO) 

<- 

sigma 






> ( V 

<- C 

7o*X 

VO t(C) 

) 


# 

joint covariance matrix 


Id] 

[,2] 

[,3] 

[,4] [ 

,5] 

[,6] 

[,7] 


[1,] 

100 

0 

100 

100 

100 

100 

100 


[2,1 

0 

1 

1 

1 

1 

1 

1 


[3,] 

100 

1 

102 

101 

101 

101 

101 


[4,1 

100 

1 

101 

102 

101 

101 

101 


[5,] 

100 

1 

101 

101 

102 

101 

101 


[6,] 

100 

1 

101 

101 

101 

102 

101 


[7,1 

100 

1 

101 

101 

101 

101 

102 



As we see the result is identical to that obtained setting the elements ‘by hand’. 
Then let us now repeat the steps previously followed without systematic offset. 


6.1.1 Condition on Xi = 2 (“known true value”) 

> ( mu.c <- c(2, rep(NA, n-1)) ) 

[1] 2 NA NA NA NA NA NA 
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> ( out <- norm.mult.cond(mu, V, mu.c) ) 

$mu 

[1] 2 0 2 2 2 2 2 
$V 

[,1] [,2] [,3] [,4] [,5] [,6] [,7] 

[1,] 0 0 0 0 0 0 0 
[2,] 0 111111 
[3,] 0 12 1111 

[4,] 0 112 111 

[5,] 0 1112 11 

[6,] 0 11112 1 
[7,] 0 111112 

> ( out.s <- sqrt(diag(out$V)) ) 

[1] 0.000000 1.000000 1.414214 1.414214 1.414214 1.414214 1.414214 

> roundC out$V / outer(out.s, out.s), 3) 

[,1] [,2] [,3] [,4] [,5] [,6] [,7] 

[1,] NaN NaN NaN NaN NaN NaN NaN 
[2,] NaN 1.000 0.707 0.707 0.707 0.707 0.707 
[3,] NaN 0.707 1.000 0.500 0.500 0.500 0.500 
[4,] NaN 0.707 0.500 1.000 0.500 0.500 0.500 
[5,] NaN 0.707 0.500 0.500 1.000 0.500 0.500 
[6,] NaN 0.707 0.500 0.500 0.500 1.000 0.500 
[7,] NaN 0.707 0.500 0.500 0.500 0.500 1.000 

The condition on the ‘true value’ changes the values of the observables to its value, but it 
does not affect the offset, which has a role in the uncertainty of the future observations as 
well in their correlation. In fact, contrary to the case see in the previous section without 
uncertain offset, they are not any longer independent. They would become independent if 
also the offset were known (try for example with “mu.c <- c(2, 0, rep(NA, n-2))” to 
see the difference, or even better with “mu. c <- c(2, 1, rep(NA, n-2))”). 

6.1.2 Condition on = 2 (“single observation”) 

> ( mu.c <- c(NA, NA, 2, rep(NA, n-3)) ) 

[1] NA NA 2 NA NA NA NA 

> out <- norm.mult.cond(mu, V, mu.c) 

> round( out$mu, 4) 

[1] 1.9608 0.0196 2.0000 1.9804 1.9804 1.9804 1.9804 

> round( out$V, 4) 

[,1] [,2] [,3] [,4] [,5] [,6] [,7] 

[1,] 1.9608 -0.9804 0 0.9804 0.9804 0.9804 0.9804 

[2,] -0.9804 0.9902 0 0.0098 0.0098 0.0098 0.0098 

[3,] 0.0000 0.0000 0 0.0000 0.0000 0.0000 0.0000 

[4,] 0.9804 0.0098 0 1.9902 0.9902 0.9902 0.9902 

[5,] 0.9804 0.0098 0 0.9902 1.9902 0.9902 0.9902 

[6,] 0.9804 0.0098 0 0.9902 0.9902 1.9902 0.9902 

[7,] 0.9804 0.0098 0 0.9902 0.9902 0.9902 1.9902 

> round( out.s <- sqrt(diag(out$V)), 4 ) 

[1] 1.4003 0.9951 0.0000 1.4107 1.4107 1.4107 1.4107 

> round( out$V / outer(out.s, out.s), 3) 

[,1] [,2] [,3] [,4] [,5] [,6] [,7] 

[1,] 1.000 -0.704 NaN 0.496 0.496 0.496 0.496 

[2,] -0.704 1.000 NaN 0.007 0.007 0.007 0.007 
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[3 

,] 


NaN 


NaN 

NaN 


NaN 


NaN 


NaN 


NaN 

[4 

,] 

0 

.496 

0, 

.007 

NaN 

1 

.000 

0 

.498 

0 

.498 

0 

.498 

[5 

,] 

0 

.496 

0, 

.007 

NaN 

0 

.498 

1 

.000 

0 

.498 

0 

.498 

[6 

,] 

0 

.496 

0, 

.007 

NaN 

0 

.498 

0 

.498 

1 

.000 

0 

.498 

[7 

,] 

0 

.496 

0, 

.007 

NaN 

0 

.498 

0 

.498 

0 

.498 

1 

.000 


To understand the result we need to compare it with the case without uncertainty uncer¬ 
tainty. In that case we had Xi = 1.98. Now we have Xi = 1.96. The difference, although 
practically irrelevant, is conceptually important. It is indeed equal to the expected value 
of the offset (precisely 0.0196). This is because the role of the observation is to give us an 
information about Xi + X 2 , sum of the true value and the offset. The fact that we use the 
observations to update our knowledge on the true value is simply because the offset is a 
priori better known that the true value, as it is well understood by experienced physicists: 
if the calibration is poor the instrument cannot be used for ‘measurements’. Note also the 
correlation that now appears between Xi and X 2 , and in particular its negative sign: the 
value of the true value could increase at the ‘expenses’ of the offset, and the other way 
around. 


6.1.3 Condition on X 3 = 2 and X 4 = 1 (“two observations”) 


> ( mu.c <- c(NA, NA, 2, 1, rep(NA, n-4)) ) 
[1] NA NA 2 1 NA NA NA 


> out <- norm.mult.cond(mu, V, mu.c) 

> round( out$mu, 4) 


[1] 1.4778 0.0148 2.0000 1.0000 1.4926 1.4926 1.4926 


> round( out$V 

[, 1 ] 

[1,] 1.4778 -0.9852 

[2,] -0.9852 0.9901 

[3,] 0.0000 0.0000 

[4,] 0.0000 0.0000 

[5,] 0.4926 0.0049 

[6,] 0.4926 0.0049 0 

[7,] 0.4926 0.0049 0 


[,5] [,6] [,7] 
0 0.4926 0.4926 0.4926 
0 0.0049 0.0049 0.0049 
0 0.0000 0.0000 0.0000 
0 0.0000 0.0000 0.0000 
0 1.4975 0.4975 0.4975 
0 0.4975 1.4975 0.4975 
0 0.4975 0.4975 1.4975 


4) 

[,2] [,3] [,4] 
0 
0 
0 
0 
0 


> roundC out.s <- sqrt(diag(out$V)), 4 ) 


[1] 1.2157 0.9951 0.0000 0.0000 1.2237 1.2237 1.2237 


> round( out$V / outer(out.s, out.s), 3) 




[,1] 


[,2] 

[,3] 

[,4] 


[,5] 


[,6] 


[,7] 

[1,] 

1 

.000 

-0 

.814 

NaN 

NaN 

0 

.331 

0 

.331 

0 

.331 

[2,1 

-0 

.814 

1 

.000 

NaN 

NaN 

0 

.004 

0 

.004 

0 

.004 

[3,] 


NaN 


NaN 

NaN 

NaN 


NaN 


NaN 


NaN 

[4,1 


NaN 


NaN 

NaN 

NaN 


NaN 


NaN 


NaN 

[5,] 

0 

.331 

0 

.004 

NaN 

NaN 

1 

.000 

0 

.332 

0 

.332 

[6,] 

0 

.331 

0 

.004 

NaN 

NaN 

0 

.332 

1 

.000 

0 

.332 

[7,1 

0 

.331 

0 

.004 

NaN 

NaN 

0 

.332 

0 

.332 

1 

.000 


The only new effect we observe is the increase (in module) of the correlation coefficient 
between true value and offset. This is due to the fact that the increased number of ob¬ 
servation has increased the constrain between the two quantities. It will increase more 
if we use further observations, for example conditioning on ’’mu.c <- c(NA, NA, 2, 1, 
1.5, 2.2, 0.5)”, or decreasing the standard deviations cjj|i 2 - For example if we set all 
^i\i ,2 to 0.1, the same conditioning on X 3 and X 3 would produce a correlation coefficient 


33 



of —0.9975. Asymptotically there will be a deterministic constrain between and X 2 of 
the kind Xi + X 2 = k, and the two variables become logically dependent. 

6.1.4 “Ricalibration of the offset” {Xi = 2; A 3 = 2, A 4 = 1) 

What happens if we instead fix the value of the true value and some values of the observ¬ 
ables? In this case we update our information on the offset. Let us see the case in which 
we fix the value of the true value at 2, and the average of the two observations at 1.5. 

> ( mu.c <- c(2, NA, 2, 1, rep(NA, n-4)) ) 

[1] 2 NA 2 1 NA NA NA 

> out <- norm.mult.cond(mu, V, mu.c) 

> round( out$mu, 4) 

[1] 2.0000 -0.3333 2.0000 1.0000 1.6667 1.6667 1.6667 

> round( out$V, 4) 

[,1] [,2] [,3] [,4] [,5] [,6] [,7] 

[1,] 0 0.0000 0 0 0.0000 0.0000 0.0000 

[2,] 0 0.3333 0 0 0.3333 0.3333 0.3333 

[3,] 0 0.0000 0 0 0.0000 0.0000 0.0000 

[4,] 0 0.0000 0 0 0.0000 0.0000 0.0000 

[5,] 0 0.3333 0 0 1.3333 0.3333 0.3333 

[6,] 0 0.3333 0 0 0.3333 1.3333 0.3333 

[7,] 0 0.3333 0 0 0.3333 0.3333 1.3333 

> roundC out.s <- sqrt(diag(out$V)), 4 ) 

[1] 0.0000 0.5774 0.0000 0.0000 1.1547 1.1547 1.1547 

> round( out$V / outer(out.s, out.s), 3) 



[,1] 

[,2] 

[,3] 

[,4] 

[,5] 

[,6] 

[,7] 

[1,] 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

[2,] 

NaN 

1.0 

NaN 

NaN 

0.50 

0.50 

0.50 

[3,] 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

[4,] 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

[5,] 

NaN 

0.5 

NaN 

NaN 

1.00 

0.25 

0.25 

[6,] 

NaN 

0.5 

NaN 

NaN 

0.25 

1.00 

0.25 

[7,] 

NaN 

0.5 

NaN 

NaN 

0.25 

0.25 

1.00 


As a result, the expected value of the offset becomes —0.33, with a standard deviation of 
0.58, against the (possible) intuitive guess of —0.5 (i.e 1.5 —2.0) with a standard uncertainty 
of 0.71 (i.e. l/y/2). The reason is that our prior knowledge on the offset had a standard 
uncertainty of 1, that has to be taken into account. Indeed it can be easily checked that the 
‘intuitive’ result would have been recovered if we had a very large uncertainty (cj2 —)• 00). 
In fact —0.33 is the weighted average of the initial value 0 and —0.5, with weights equal to 
1 and 2. The reason is that the result based on reconditioning provides automatically the 
rule of the weighted average with ‘inverse of the variances’, where the ‘variance’ associated 
to —0.5 would be that obtained if the prior knowledge on the offset was irrelevant (i.e. 
(T2 —)■ 00). 

7 Measuring two quantities with the same instrument af¬ 
fected by offset uncertainty 

Another interesting issue, very common in experimental physics, is when we make several 
measurements on homogeneous quantities using the same instrument that, as all instru¬ 
ments, has unavoidable uncertainty in the calibration. The situation is sketched in the 
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Figure 9: Model to describe the measurements of two quantities with the same instrument 
affected by some systematics. 


diagrams of Fig. [9l drawn with the different symbols used in the text: Xi and X 2 are the 
true values; X^ the common offset; X 4 and X 5 the independent readings when the instru¬ 
ment is applied to Xi; Xq and Xy the independent readings when the instrument is applied 
to X 2 . 

From this model we can easily build the transformation matrix C 


C 


/lOOOOOOX 
0 1 0 0 0 0 0 

0 0 1 0 0 0 0 

10 110 0 0 
10 10 10 0 

0 110 0 10 

V 0 1 1 0 0 0 1 / 


( 86 ) 


(for example it says that row 6 depends on X 2 , X 3 and 63). Applying it to the starting diag¬ 
onal matrix (ui, V 2 and 2 : are initially independent, the various errors 61-64 are independent) 
we get the covariance matrix of the joint multivariate normal of interest: 


(<y\ 

0 

0 


0 

A 

0 

0 

0 

0 

^3 

f^3 


0 


cl + c‘l + c 


0 

^3 

cf + ci 

0 

A 

^3 


\o 


^3 

^3 




cr? 




cjf + (T| 

+ 0 -| -h CT, 

CTo 


£2 


^2 

cl 


^3 

al 


^2 


+ 


£3 


ai + ai 


a' 


^2 

cl 


^3 

cl 


C 2 + cl 

+ 4 + ) 


This is a very interesting covariance matrix and we leave the reader the pleasure of exploiting 
all possibilities. Here we only show a numerical example, with parameters similar to the 
ones used before for a better understanding, and just discuss a single case of conditioning. 


> n=7; muXl=0; sigmaXl=10; muX2=0; sigmaX2=10; # set parameters 

> muZ=0; sigmaZ=l 

> mu <- c(muXl, muX2, muZ, rep(muXl+muZ,2), rep(muX2+muZ,2)) # set expected values 

> ( sigma <- cCsigmaXl, sigmaXl, sigmaZ, rep(l, n-3)) ) # standard deviations 
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# tranformation matrix 


[ 1 ] 10 10 1 1 1 1 1 

> C <- matrix(rep(0, n*n), c(n,n)) 

> diag(C) <- repd, n) 


> 

C[4 

,] 

<- 

c(l. 

0, 

1, 

1, 

0, 

. 0, 

0) 

> 

C[5 

,] 

<- 

c(l. 

0, 

1, 

0, 

1, 

. 0, 

0) 

> 

C[6 

,] 

<- 

c(0. 

1, 

1, 

0, 

0, 

. 1, 

0) 

> 

C[7 

,] 

<- 

c(0. 

1, 

1, 

0, 

0, 

. 0, 

1) 

> 

C 












[, 

1] 

[,2] 

[,3] 

1: 

,4] 

[, 

,5] 

[,6] 


[ 1 ,] 1 0 0 0 0 0 0 

[ 2 ,] 0 1 0 0 0 0 0 

[3,] 0 0 1 0 0 0 0 

[4,] 10 110 0 0 

[5,] 10 10 10 0 

[ 6 ,] 0 110 0 10 

[7,] 0 1 1 0 0 0 1 

> VO <- matrix(rep(0, n*n), c(n,n)) # covariance matrix 

> diag(VO) <- sigma~2 

> V <- C 7.*7. VO 7.*7. t(C) 


> V 

[,1] 

[,2] 

[,3] 

[,4] 

[,5] 

[,6] 

[,7] 

[1,] 

100 

0 

0 

100 

100 

0 

0 

[2,] 

0 

100 

0 

0 

0 

100 

100 

[3,] 

0 

0 

1 

1 

1 

1 

1 

[4,] 

100 

0 

1 

102 

101 

1 

1 

[5,] 

100 

0 

1 

101 

102 

1 

1 

[6,] 

0 

100 

1 

1 

1 

102 

101 

[7,] 

0 

100 

1 

1 

1 

101 

102 


> su <- sqrt(diag(V)) # standard uncertainties 

> roundC V/outer(su,su), 4) # correlation matrix 

[,1] [,2] [,3] [,4] [,5] [,6] [,7] 

[1,] 1.0000 0.0000 0.000 0.9901 0.9901 0.0000 0.0000 

[2,] 0.0000 1.0000 0.000 0.0000 0.0000 0.9901 0.9901 

[3,] 0.0000 0.0000 1.000 0.0990 0.0990 0.0990 0.0990 

[4,] 0.9901 0.0000 0.099 1.0000 0.9902 0.0098 0.0098 

[5,] 0.9901 0.0000 0.099 0.9902 1.0000 0.0098 0.0098 

[6,] 0.0000 0.9901 0.099 0.0098 0.0098 1.0000 0.9902 

[7,] 0.0000 0.9901 0.099 0.0098 0.0098 0.9902 1.0000 

Now let us assume we have applied our instrument once on Xi and once on X 2 , obtaining 
the readings X 4 = 1 and Xq = 2, respectively. Here is how our knowledge is updated: 

> ( mu.c <- c(rep(NA, 3), 1, NA, 2, NA) ) # conditioning 

[1] NA NA NA 1 NA 2 NA 

> out <- norm.mult.cond(mu, V, mu.c) 

> round( out$mu, 4) 

[1] 0.9613 1.9514 0.0291 1.0000 0.9904 2.0000 1.9805 

> round( out$V, 4) 



[,1] 

[,2] 

[,3] 

[,4] 

[,5] 

[,6] 

[,7] 

[1,] 

1.9514 

0.9613 

-0.9709 

0 

0.9805 

0 

-0.0096 

[2,] 

0.9613 

1.9514 

-0.9709 

0 

-0.0096 

0 

0.9805 

[3,] 

-0.9709 

-0.9709 

0.9806 

0 

0.0097 

0 

0.0097 

[4,] 

0.0000 

0.0000 

0.0000 

0 

0.0000 

0 

0.0000 

[5,] 

0.9805 

-0.0096 

0.0097 

0 

1.9902 

0 

0.0001 

[6,] 

0.0000 

0.0000 

0.0000 

0 

0.0000 

0 

0.0000 
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0 1.9902 


[7,] -0.0096 0.9805 0.0097 0 0.0001 

> round( out.s <- sqrt(diag(out$V)), 4 ) 

[1] 1.3969 1.3969 0.9902 0.0000 1.4107 0.0000 1.4107 


> roundC out$V / outer(out.s, out.s), 3) 




[,1] 


[,2] 


[,3] 

[,4] 


[,5] 

[,6] 


[,7] 

[1,] 

1 

.000 

0 

.493 

-0 

.702 

NaN 

0 

.498 

NaN 

-0 

.005 

[2,1 

0 

.493 

1 

.000 

-0 

.702 

NaN 

-0 

.005 

NaN 

0 

.498 

[3,] 

-0 

.702 

-0 

.702 

1 

.000 

NaN 

0 

.007 

NaN 

0 

.007 

[4,1 


NaN 


NaN 


NaN 

NaN 


NaN 

NaN 


NaN 

[5,] 

0 

.498 

-0 

.005 

0 

.007 

NaN 

1 

.000 

NaN 

0 

.000 

[6,] 


NaN 


NaN 


NaN 

NaN 


NaN 

NaN 


NaN 

[7,1 

-0 

.005 

0 

.498 

0 

.007 

NaN 

0 

.000 

NaN 

1 

.000 


As expected, X 4 = 1 sets essentially to 1 the true value Xi and the ‘future’ - or not yet 
known! - reading X^. Similarly, Xq sets essentially to 2 X 2 and Xj. (The difference from 
the exact value of 1 and 2, respectively, is due - let us repeat it once again - to the fact 
that we use, for didactic purposes, initial standard uncertainties ui and 1 T 2 ‘relatively small’, 
while the uncertainty on the common offset is ‘relatively large’.) The most interest part of 
the result is the 3x3 upper left part of the resulting correlation matrix, which we repeat 
here: 

/ 1.000 0.493 -0.702 \ 

0.493 1.000 -0.702 

\ -0.702 -0.702 1.000 / 

As we have learned in the previous section, the value of the offset gets anticorrelated to the 
true values. Moreover the two true values get positively correlated, as expected: a part 
of our uncertainty on them is due the imprecise knowledge of the offset, which then affects 
both values in the same direction. 


8 Inferences and forecasting based on mean values 

Often our inferences and forecasting are based on averages, instead than on individual 
values. It is rather understood that in Gaussian samples the inference on the Gaussian ‘/r’ 
is the same if we use the mean rather than the detailed information, due to the so called 
property of ‘statistical sufficiency’. It is instead less clear what we should expect for a next 
mean, based on a sample of the same size of the first one. For example, very often one ears 
and read^il something like “if we have got a mean (xp) and then imagine to repeat a large 
number of independent samples of the same size (n) of the ‘past’ one, then we expect about 
in the interval Xp ± a /^/n ”, that is 

Pi'^P <Xf<Xp + ^) = 68%, (87) 


For example, we read in Ref. [B] (pp. 118-119) 

“In reporting the measurement of 9 as 9obs ± &§ one means that repeated estimates all based 
on n observations of x would be distributed according to a p.d.f. g{9) centered around some 
true value 9 and true standard deviation ag, which are estimated to be 9oba and Og.” Mistakes 
of this kind are due to a curious ideology that refuses to make probabilistic statements about 
uncertain values, in contrast to the physicist’ intuition (see extended discussions in Ref. [^, 
with hints on the way of reasoning of Gauss and Laplace). 
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0 0 


Figure 10: Past mean and future mean. Xi is the value of the quantity, X 2 -X 4 the ob¬ 
servations in the first samples (i.e. oi, 02 and 03 ), X 5 -X 7 the observations in the second 
samples (i.e. 04 , 05 and 05 ). Xs and Xq are the sample means. 


a statement wrong by a factor \j\f2 in the size of the interval (or, equivalently, about 
30% in the value of expected frequency, that should be 52%) as we shall see in a while (see 
also Ref. 0 ). In order to do this, let us play with our tool, building the minimal model to 
observe the effect of interest. 

Figure [To] shows the value of a quantity (Xi) and two samples, each of three observations 
(X 2 -X 4 and X 5 -X 7 ), whose mean values are Xg and Xg (the dashed arrows indicate that 
the links are deterministic instead than probabilistic, since an arithmetic mean is univocally 
determined by the values of the sample). We only need to write down the transformation 
rules to get Xg and Xg in order to build up the extra rows of the transformation matrix. 
They are: 


Xg = — (X2 -|- X3 -|- X4) = Xi — (ei -|- 62 -t- 63) 
Xg = — (X5 -|- Xg -|- X7) = Xi -|- — (64 -1-65-1- 60) 
from which it follows 


C 


0 0 0 0 

110 0 
10 10 
10 0 1 

10 0 0 

10 0 0 

10 0 0 

1 1/3 1/3 1/3 

V1 0 0 0 


0 0 0 \ 

0 0 0 

0 0 0 

0 0 0 

10 0 
0 1 0 

0 0 1 

0 0 0 

1/3 1/3 1/3 / 


( 88 ) 

(89) 


(90) 


Here is our implementation in R, where we have now used cii = 100, much larger than 
the ‘experimental resolution’ of 1. This choice makes, for the numerical values of the 
observations we shall use, the prior on Xi practically irrelevant (the effect on the expected 
values is of the order 10 “^) so that we can better focus on other effects: 
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# parameters 

# expected values 


> n <- 7; muXl <- 0; sigmaXl <- 100; sigma.i <- 1 

> mu <- repCmuXl, n + 2) 

> ( sigma <- cCsigmaXl, repCsigma.i,n-l)) ) 

> VO <- matrix(rep(0, n*n), c(n,n)) # initial diagonal covariance matrix 

> diag(VO) <- sigma~2 

> C <- matrix(rep(0, n*n), c(n,n)) transformation matrix 

> diag(C) <- repd, n) 

> C[,l] <- repCl, n) 

> C <- rbind( C, c(l, rep(l/3, 3), rep(0, 3)), c(l, rep(0, 3), rep(l/3, 3)) ) 

> round(C,3) 



[,1] 


[,2] 


[,3] 


[,4] 


[,5] 


[,6] 


[,r] 

[1,] 

1 

0 

.000 

0 

.000 

0 

.000 

0 

.000 

0 

.000 

0 

.000 

[2,] 

1 

1 

.000 

0 

.000 

0 

.000 

0 

.000 

0 

.000 

0 

.000 

[3,] 

1 

0 

.000 

1 

.000 

0 

.000 

0 

.000 

0 

.000 

0 

.000 

[4,] 

1 

0 

.000 

0 

.000 

1 

.000 

0 

.000 

0 

.000 

0 

.000 

[5,] 

1 

0 

.000 

0 

.000 

0 

.000 

1 

.000 

0 

.000 

0 

.000 

[6,] 

1 

0 

.000 

0 

.000 

0 

.000 

0 

.000 

1 

.000 

0 

.000 

[7,] 

1 

0 

.000 

0 

.000 

0 

.000 

0 

.000 

0 

.000 

1 

.000 

[8,] 

1 

0 

.333 

0 

.333 

0 

.333 

0 

.000 

0 

.000 

0 

.000 

[9,] 

1 

0 

.000 

0 

.000 

0 

.000 

0 

.333 

0 

.333 

0 

.333 


> V <- C 7,*7, VO 7.*7. t(C) # joint covariance matrix 

> ( su <- sqrt(diag(V)) ) # initial uncertainties 

[1] 100.0000 100.0050 100.0050 100.0050 100.0050 100.0050 100.0050 100.0017 
[9] 100.0017 

> ( m <- dim(V)[1] ) # number of transformed variables 

[1] 9 

> roundC V/outer(su,su), 5) 

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] 

[1,] 1.00000 0.99995 0.99995 0.99995 0.99995 0.99995 0.99995 0.99998 0.99998 

[2,] 0.99995 1.00000 0.99990 0.99990 0.99990 0.99990 0.99990 0.99997 0.99993 

[3,] 0.99995 0.99990 1.00000 0.99990 0.99990 0.99990 0.99990 0.99997 0.99993 

[4,] 0.99995 0.99990 0.99990 1.00000 0.99990 0.99990 0.99990 0.99997 0.99993 

[5,] 0.99995 0.99990 0.99990 0.99990 1.00000 0.99990 0.99990 0.99993 0.99997 

[6,] 0.99995 0.99990 0.99990 0.99990 0.99990 1.00000 0.99990 0.99993 0.99997 

[7,] 0.99995 0.99990 0.99990 0.99990 0.99990 0.99990 1.00000 0.99993 0.99997 

[8,] 0.99998 0.99997 0.99997 0.99997 0.99993 0.99993 0.99993 1.00000 0.99997 

[9,] 0.99998 0.99993 0.99993 0.99993 0.99997 0.99997 0.99997 0.99997 1.00000 

As we see, all quantities are now highly correlated. In particular, it is interesting to see 
how Ag and Xg are correlated with Xi, with any observation of the first sample (A 2 -A 4 ) 
and with any observation of the second sample (A 5 -A 7 ). 

8.1 Expectations for a given value of Xi 

Let us now fix Xi at our usual value of 2: 

> ( mu.c <- c(2, rep(NA, m-1)) ) # XI = 2 

[1] 2 NA NA NA NA NA NA NA NA 

> out <- norm.mult.cond(mu, V, mu.c) 

> out$mu 

[ 1 ] 222222222 

> round( out.s <- sqrt(diag(out$V)), 4 ) 

[1] 0.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.5774 0.5774 
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> roundC out$V / outer(out.s, out.s), 



[,1] 


[1.] 

NaN 


[2,1 

NaN 

1 

[3,] 

NaN 

0 

[4,1 

NaN 

0 

[5,] 

NaN 

0 

[6,] 

NaN 

0 

[7,1 

NaN 

0 

[8,] 

NaN 

0 

[9,] 

NaN 

0 


[,2] [,3] 

NaN NaN 
0000 0.0000 
0000 1.0000 
0000 0.0000 
0000 0.0000 
0000 0.0000 
0000 0.0000 
5774 0.5774 
0000 0.0000 


[,4] [,5] 

NaN NaN 
0.0000 0.0000 
0.0000 0.0000 
1.0000 0.0000 
0.0000 1.0000 
0.0000 0.0000 
0.0000 0.0000 
0.5774 0.0000 
0.0000 0.5774 


4) 

[, 6 ] 

NaN 

0.0000 

0.0000 

0.0000 

0.0000 

1.0000 

0.0000 

0.0000 

0.5774 


[.7] 

NaN 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

1.0000 

0.0000 

0.5774 


[, 8 ] 
NaN 
0.5774 
0.5774 
0.5774 
0.0000 
0.0000 
0.0000 
1.0000 
0.0000 


[,9] 

NaN 

0.0000 

0.0000 

0.0000 

0.5774 

0.5774 

0.5774 

0.0000 

1.0000 


As we already know, all observations become conditionally independent with expected value 
2 and standard uncertainty 1. The averages are also expected to be around 2, but with 
smaller uncertainty, namely l/\/3 ~ 0.5774 And, obviously, the averages are only correlated 
with each observation of their own sample. For example, for X 2 and Xg, we have, starting 
from the transformation rules (IM]l and (f88]l . we get, being Xi certain. 


and then 


Cov[A2,X8] 


1 

3 

1 

3 


X 

X 1 « 0.33, 


p[X2,Xs] 


Cov[X 2 ,X 8 ] 
(t[X 2] • ^[As] 

1 X 1/v^ 


0.5774, 


(91) 

(92) 


(93) 

(94) 


that is exactly what we can read in the R output. 


8.2 Reconditioning on the valne of the first mean 


Let us now see what happens if we get informed about a mean value, e.g. Ag = 2. 


> ( mu.c <- c(rep(NA, m-2), 2, NA) ) # first mean = 2 

[1] NA NA NA NA NA NA NA 2 NA 

> out <- norm.mult.cond(mu, V, mu.c) 

> round(out$mu, 5) 

[1] 1.99993 2.00000 2.00000 2.00000 1.99993 1.99993 1.99993 2.00000 1.99993 


> round( out.s <- sqrt(diag(out$V)), 4) 

[1] 0.5773 0.8165 0.8165 0.8165 1.1547 1.1547 1.1547 0.0000 0.8165 


> round(out$V, 4) 



[,1] 

[,2] 

[,3] 

[1.] 

0.3333 

0.0000 

0.0000 

[2,1 

0.0000 

0.6667 

-0.3333 

[3,] 

0.0000 

-0.3333 

0.6667 

[4,1 

0.0000 

-0.3333 

-0.3333 

[5,] 

0.3333 

0.0000 

0.0000 

[6,] 

0.3333 

0.0000 

0.0000 

[7,1 

0.3333 

0.0000 

0.0000 

[8,] 

0.0000 

0.0000 

0.0000 

[9,] 

0.3333 

0.0000 

0.0000 


> round( out$V / outer(out 


[,4] 
0.0000 
-0.3333 
-0.3333 
0.6667 
0.0000 
0.0000 
0.0000 
0.0000 
0.0000 
s, out. 


[,5] 
0.3333 
0.0000 
0.0000 
0.0000 
1.3333 
0.3333 
0.3333 
0.0000 
0.6667 
;), 4) 


[, 6 ] 

0.3333 

0.0000 

0.0000 

0.0000 

0.3333 

1.3333 

0.3333 

0.0000 

0.6667 


[,7] [,8] 
0.3333 0 
0.0000 0 
0.0000 0 
0.0000 0 
0.3333 0 
0.3333 0 
1.3333 0 
0.0000 0 
0.6667 0 


[,9] 

0.3333 

0.0000 

0.0000 

0.0000 

0.6667 

0.6667 

0.6667 

0.0000 

0.6667 
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[,1] 

[,2] 

[,3] 

[,4] 

[,5] 

[,6] 

[,7] 

[,8] 

[,9] 

[1,] 

1.0000 

0.0 

0.0 

0.0 

0.5000 

0.5000 

0.5000 

NaN 

0.7071 

[2,] 

0.0000 

1.0 

-0.5 

-0.5 

0.0000 

0.0000 

0.0000 

NaN 

0.0000 

[3,] 

0.0000 

-0.5 

1.0 

-0.5 

0.0000 

0.0000 

0.0000 

NaN 

0.0000 

[4,] 

0.0000 

-0.5 

-0.5 

1.0 

0.0000 

0.0000 

0.0000 

NaN 

0.0000 

[5,] 

0.5000 

0.0 

0.0 

0.0 

1.0000 

0.2500 

0.2500 

NaN 

0.7071 

[6,] 

0.5000 

0.0 

0.0 

0.0 

0.2500 

1.0000 

0.2500 

NaN 

0.7071 

[7,] 

0.5000 

0.0 

0.0 

0.0 

0.2500 

0.2500 

1.0000 

NaN 

0.7071 

[8,] 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

[9,] 

0.7071 

0.0 

0.0 

0.0 

0.7071 

0.7071 

0.7071 

NaN 

1.0000 


The knowledge about the first average constrains Xi to 2.00±0.58, that is xXaj\/n, while 
the expectations about the next average is 2.00 ± 0.82, that is x ± The future 

observations are instead expected to be 2.00 ± 1.15, where the standard uncertainty comes 
from Vo.577^ + 1^, quadratic combination of the uncertainty about X\ and that of any of 
the future observations around X\. 

And, as expected, there are correlations among all values which are still uncertain, with 
the exception of Xi with X 2 , X^ and X 4 (the observations of the first sample). This on a 
first sight is not very intuitive. The reason is that Xi is fully determined by the average Xg, 
and therefore our knowledge about it cannot change if we are informed about the individual 
values of the measurements, as we shall see in the next subsection. 

Remaining on the values of the first sample, their expected value is exactly 2, instead 
than 1.99993, a difference absolutely negligible in practice, but very interesting indeed to 
understand the flow of the probabilistic updates. Their values depend only on the average, 
and not on the prior about Xi. Their uncertainty is the same as the uncertainty on the 
future average {y/2 x 1/V3), although not easy to understand at an intuitive level. Easier 
to understand are their mutual anticorrelations, since their linear combination Xg (their 
mean value) is fixed. 

8.3 Knowing the average and one of the values that contribute to the 
first mean 

In order to better understand the role of the mean in the inference, let us assume we also 
know the value of one of the three observations contributing to it, for example X 2 = 1. 

> ( mu.c <- c(NA, 1, rep(NA, m-4), 2, NA) ) # first mean (X8) = 2; X2=l 

[1] NA 1 NA NA NA NA NA 2 NA 

> out <- norm.mult.cond(mu, V, mu.c, check=FALSE) 

> round(out$mu, 4) 

[1] 1.9999 1.0000 2.5000 2.5000 1.9999 1.9999 1.9999 2.0000 1.9999 

> round( out.s <- sqrt(diag(out$V)), 4 ) 

[1] 0.5773 0.0000 0.7071 0.7071 1.1547 1.1547 1.1547 0.0000 0.8165 


round(out$V, 4) 
[,1] [,2] 

[,3] 

[,4] 

[,5] 

[,6] 

[,7] 

[,8] 

[,9] 

[1,] 

0.3333 

0 

0.0 

0.0 

0.3333 

0.3333 

0.3333 

0 

0.3333 

[2,] 

0.0000 

0 

0.0 

0.0 

0.0000 

0.0000 

0.0000 

0 

0.0000 

[3,] 

0.0000 

0 

0.5 

-0.5 

0.0000 

0.0000 

0.0000 

0 

0.0000 

[4,] 

0.0000 

0 

-0.5 

0.5 

0.0000 

0.0000 

0.0000 

0 

0.0000 

[5,] 

0.3333 

0 

0.0 

0.0 

1.3333 

0.3333 

0.3333 

0 

0.6667 

[6,] 

0.3333 

0 

0.0 

0.0 

0.3333 

1.3333 

0.3333 

0 

0.6667 

[7,] 

0.3333 

0 

0.0 

0.0 

0.3333 

0.3333 

1.3333 

0 

0.6667 

[8,] 

0.0000 

0 

0.0 

0.0 

0.0000 

0.0000 

0.0000 

0 

0.0000 

[9,] 

0.3333 

0 

0.0 

0.0 

0.6667 

0.6667 

0.6667 

0 

0.6667 
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> roundC out$V / outer(out.s, out.s), 4) 



[,1] 

[,2] 

[,3] 

[,4] 

[,5] 

[,6] 

[,7] 

[,8] 

[,9] 

[1,] 

1.0000 

NaN 

0 

0 

0.5000 

0.5000 

0.5000 

NaN 

0.7071 

[2,1 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

[3,] 

0.0000 

NaN 

1 

-1 

0.0000 

0.0000 

0.0000 

NaN 

0.0000 

[4,1 

0.0000 

NaN 

-1 

1 

0.0000 

0.0000 

0.0000 

NaN 

0.0000 

[5,] 

0.5000 

NaN 

0 

0 

1.0000 

0.2500 

0.2500 

NaN 

0.7071 

[6,] 

0.5000 

NaN 

0 

0 

0.2500 

1.0000 

0.2500 

NaN 

0.7071 

[7,1 

0.5000 

NaN 

0 

0 

0.2500 

0.2500 

1.0000 

NaN 

0.7071 

[8,] 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

NaN 

[9,1 

0.7071 

NaN 

0 

0 

0.7071 

0.7071 

0.7071 

NaN 

1.0000 


As we can see, the inference about Xi does not change. As a consequence, also the expec¬ 
tations about the future observations are not affected by this extra piece of informations. 
Instead, we change our knowledge about X 3 and X 4 , whose expected values become 2.5, in 
order to compensate X 2 = 1 [i.e 2.5 = (3x2 — l)/2] and they are fully anticorrelated, as 
more or less expected. 


9 The effect of a constrain among true values 

Another important issue is how the knowledge that the some quantities are intrinsically 
correlated changes the inference. Cases of this kind happen when several quantities are 
related by a deterministic relation, and a well understood case is when measuring the 
internal angles of a triangles in a flat space. Just to focus on a numerical example, let us 
imagine the individual angles to be determined, starting from very vague priors as 


a 

= 58° ±2° 

(95) 

fi 

= 73° ±2° 

(96) 

7 

= 54° ±2°. 

(97) 


The measurements can be independent, as we have supposed (let us forget the case of 
measurements with common systematics in order to focus on the effect of the constrain), 
but nevertheless the relation a+fS+j = 180° will make the results correlated. The graphical 
model is represented in figure [ 11 ] with the extra node A 7 representing the sum of the angles 
and related to Ai, X2 and X 3 by deterministic links (dashed arrows). 



Figure 11: Inferring the internal angles of a triangle by independent measurements. 


42 





9.1 Exact solution in the case of identical resolution of the goniometer 
and neglecting systematic effects 

Being the case rather simple, especially if all uncertainties are equal, let us make the exercise 
of going through the exact solution. Indicating the angles all together with the variable 
X = {a, 13, 7 }, whose expected value is E[X] = {a = 58°, 6 = 73°, c = 54°}. The covariance 
matrix is diagonal with all terms equal to = (2°)^. We make then the transformation to 
Y = {a, j3, 7 , Sj, where S = a+/ 3 + 7 , and then condition on S = 180°. The transformation 
matrix is then 


C = 


from which we obtain 


Vy = 


(\ 0 0 \ 
0 1 0 
0 0 1 
VI 1 1/ 



(\ 0 0 \ 
0 1 0 
0 0 1 
VI 1 1/ 


0 0 1 


(98) 


10 1 ) = 

0 0 11 


0 
0 


a 

0 


0 

0 

^72 


\ 

C72 

^72 

2 >a^ J 


(99) 


Conditioning on S = 180°, that is = 180°, using Eqs. (l3^ and (HOjl . we get 



S=180° 



( 100 ) 


with A (/9 = (a + 6 + c) — 180°. In practice the resulting rule is the most naive one could 
imagine: subtract to each value one third of the excess of their sum above 180°. (If you 
think that this rule is to simplistic, the reason might be that your model of uncertainty 
in this kind of measurements is different than that used here, implying for example scale 
type errors. But this kind of errors are beyond the aim of this note, because they imply 
non-linear transformations.) 

This is the conditioned covariance matrix 


2u2 


1 - 1/2 - 1/2 
- 1/2 1 - 1/2 
- 1/2 - 1/2 1 

written in a form that highlights the correlation matrix. The result is finally 

A(^' 


( 101 ) 


a = 


a — 


a 


( 102 ) 


and similar expression for (3 and 7 , thus yielding 


a 

= 56.33° ± 1.63° 

(103) 

/3 

= 71.33° ± 1.63° 

(104) 

7 

= 52.33° ± 1.63° , 

(105) 


with p{a,P) = /9(a,7) = p(/3,7) = - 1 / 2 . 
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9.2 Numerical implementation in R 


In analogy of what we have previously done in several cases, we start from independent 
quantities ui, U 2 , v^, ei, eo and 63 . For the true values of the angle we choose a flat prior, 
modelled with a Gaussian^u of central value (all values in degrees) 60 and a = 1000. The 
expected values of the fluctuations of the observations around the true values are instead 
0 , with standard deviations equal to the experimental resolutions, called sigma.gonio in 
the code, so that it can be changed at wish. 

The transformation rules are 


Xf 

= Vl 

(106) 

^2 

= V 2 

(107) 

^3 

= V 3 

(108) 

X4 

= oi = ui + ei 

(109) 

^5 

= O 2 = U 2 + 62 

(110) 

^6 

= O 3 = ^3 + 63 

(111) 


= V 1 +V 2 + V 3 

(112) 


from which we get the transformation matrix 


C 


/ 1 0 0 0 0 0 \ 

0 1 0 0 0 0 

0 0 1 0 0 0 

10 0 10 0 
0 10 0 10 

0 0 1 0 0 1 

V 1 1 1 0 0 0 / 


(113) 


Here is the R code to calculate the expected values and covariance matrix of the three 
angles: 

mu.priors <- rep(60, 3); sigma.priors <- repClOOO, 3) # priors 

sigma.gonio <- c(2, 2, 2) # experimental resolutions 

m=6; muO <- c(mu.priors, rep(0, 3)) 
sigma <- c(sigma.priors, sigma.gonio) 

VO <- matrix(rep(0, m*m), c(m,m)) # diagonal matrix 

diag(VO) <- sigma"2 

C <- matrix(rep(0, m*m), c(m,m)) # tranformation matrix 

diag(C) <- 1 

ford in 1:3) C[3+i, i] <- 1 
C <- rbind(C, c( rep(l, 3), rep(0,3)) ) 


V <- C y,*"/, VO y*"/, t(C) # transformed matrix 

mu <- as.vector(C "/,*"/ muO) # expected values 


^®Let us remind that this does not imply we believe that the angles could be negative or larger than 
180°: it is just a trick to have a pdf that is practically flat between 0 and 180°. The trick allows us to use 
the normal multivariate formulae of reconditioning. Obviously, one has to check that the final results are 
consistent with our assumptions and that the tails of the Gaussian posterior distributions are harmless, as 
it is the case in our example. 
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out <- norm.mult.cond(mu, V, c(NA, NA, NA, 58, 73, 54, 180) ) 
angles <- marginal.norm(out$mu, out$V, rep(l,3)) 

And these are, finally, the results, shown as an R session: 

> angles$mu 

[1] 56.33335 71.33329 52.33336 

> ( sigma.angles <- sqrt(diag(angles$V)) ) 

[1] 1.63299 1.63299 1.63299 

> ( corr <- angles$V / outer(sigma.angles, sigma.angles) ) 

[,1] [,2] [,3] 

[1,] 1.0000000 -0.4999976 -0.4999976 

[2,] -0.4999976 1.0000000 -0.4999976 

[3,] -0.4999976 -0.4999976 1.0000000 

As we see, we get the same results obtained above, with the advantage that we can now 
change the experimental resolutions of the individual measurement. Or we can modify the 
model, in order to include the effect of common systematic, though limited to offset type, 
exercise left to the reader. 


10 Fitting linear models to data, with ‘known’ standard de¬ 
viations on the y axes 

As a last example, let us see how simple fits can be described in terms of conditioned normal 
multivariates. ‘Simple’ does not mean here linear fits, because even a realistic model to fit 
a straight line through data points is not that ‘simple’, if we are interested to infer also 
the standard deviations(s) describing the errors and we consider errors on both axes (see 
Ref. [^, from which Fig. [ 12 ] has been taken). On the other hand, also fitting high order 
polynomials can be considered ‘simple’, under the same assumptions. 

The meaning of Fig. [ 12 ] is that for each data point we have three uncertain quantities: 
the true value of x (“/Ua,!’), the observed xi and the observed yi, while the true value pxi 





Figure 12 : Graphical representation of the model in term of a Bayesian network. 
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Figure 13: Linear fit to data point under usual simplifications (see text). 


is deterministically related to fixi and to the model parameters. So, for n data points the 
‘really’ dimensionality of the problem (i.e. not taking into account the is 3 x n + Up, 
where Up is the number of parameters. The inference on the parameters 6 is the performed 
conditioning on all x and y and marginalizing on /Tx- 

A usual simplification is to ignore the errors on the x values, making then deter¬ 
ministically on Xi and 0. Or, if we like, we can see each yi caused by the corresponding Xi 
and the set of parameters 0. Assuming a normal error distribution with known standard 
deviations, linear and quadratic models can be described as 


Yi ~ Af{c + mxi,ai) (linear) (114) 

Yi ~ Af{a + bxi + cxj, ai) (quadratic) (115) 

and then expanded to all possible models of the kind 


Yi 


JY{I3i gi{xi) +/32 g 2 ixi)-\ - , ai), 


(116) 


where ^lO, g 20 ^^4 so on are mathematical functions of Xi not containing free parameters!^ 
It is then rather clear that under these assumptions the problem can be treated using the 
properties of the multivariate normal distributions. 

The general model of Fig. [T2] becomes, for the first three data points, that of Fig. [T3l 
The variables of our problem are then, indicating them with Zi 


Zi = c (117) 

Z 2 = m (118) 

Z 3 = Ti = c-|-xi m + Cl (119) 

Zi = Y 2 = c + X 2 m + e 2 (120) 

Z 5 = Y 3 = c + X 3 m + 63 (121) 


■^^To make it even more clear, in the case of the quadratic model we have: / 3 i = a, /32 = 6 and jSs = c; 
gi{xi) = 1, g2{xi) = Xi and gaixi) = x}. 
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X 


Figure 14: Points to be fitted by a straight line. Our task is to infer the line parameters 
and to make previsions about future measurements at the x points indicated by the dashed 
vertical lines (0, 8 and 10). Note that no “uncertainty bars” have been drawn around the 
points, since the points are certain{\). What are uncertain are instead slope and intercept 
of the model. 


and so on. 

Our usual transformation matrix transformation for the case of three data points is 
ther 


C 


/I 0 0 0 0 \ 

0 10 0 0 

1 Xi 1 0 0 

1 X2 0 1 0 

Vl X3 0 0 1 / 


( 122 ) 


10.1 Numerical example with 5 ‘data points’ and 3 previsions 

As numeric example let us consider (see Fig. [T4|l the five x values x <- 2:6, in correspon¬ 
dence of which we have ‘observed’ the y values y <- c(7.0, 9.5, 11.8, 12.9, 14.8), 

■^^In the case of a parabolic fit we would have instead 


(1 

0 

0 

0 

0 


0 

1 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

1 


xi 

1 

0 

0 

1 

X2 

xi 

0 

1 

0 

V 1 

X3 

xi 

0 

0 

1/ 


and so on for higher order polynomials. 
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in fact simulated by the command 

> y <- roundC 3 + 2*yi + rnorm(length(x), 0, 0.5), 1) 

Our true values of the parameters are indeed c = 3 and m = 2, while the standard devi¬ 
ations describing the errors e* are all equal to 0.5. Moreover we consider another three 
X values (0, 8 and 10), about which we are interested in making predictions. They are 
indicated in Fig. [131 by vertical dashed lines. 

Having set up the problem, here is how we construct the initial diagonal matrix in R, 
assigning very ‘uninformative priors’ to the fit parameters!^ 


X <- 2:6 # X (’predictors’) 

y <- c(7.0, 9.5, 11.8, 12.9, 14.8) # observed y 

x.f <- c(8, 10, 0) # X of new (’future’) measurements 


cm.priors <- c(0,0); sigma.priors <- c(100,100) # priors about c and m 

sy <- 0.5 # standard deviation of Y values 

n.points <- 8 # number of points (5 data + 3 predictions) 

muO <- c(cm.priors, rep(0,n.points)) 

sigma <- c(sigma.priors, rep(sy, n.points)) 

m <- n.points +2 # dimensionality of the problem (points + parameters) 

VO <- matrix(rep(0, m*m), c(m,m)) # diagonal matrix 

diag(VO) <- sigma''2 

Then we build up the transformation matrix C and calculate the covariance matrix of the 
ten quantities of the problem (2 parameters, 5 data points and 3 points about which we 
want to make predictions): 

C <- matrix(rep(0, m*m), c(m,m)) # tranformation matrix 

diag(C) <- 1 

C[3:m, 1] <- 1 

C[3:m, 2] <- c(x, x.f) 


V <- C °/o*°/o VO %*% t(C) # transformed matrix 

mu <- as.vector(C muO) # expected values 

Here are the quantities of interest 


> sigma 

[1] le+03 le+03 5e-01 5e-01 5e-01 5e-01 5e-01 5e-01 5e-01 5e-01 


> C 


[ 1 ,] 

[ 2 ,] 

[3,] 

[4,] 

[5,] 


[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] 


1 

0 

1 

1 

1 


0 0 

1 0 

2 1 

3 0 

4 0 


0 

0 

0 

1 

0 


0 

0 

0 

0 

0 


0 

0 

0 

0 

0 


0 0 
0 0 
0 0 
0 0 
0 0 


0 

0 

0 

0 

0 


^^The priors of the numerical examples are c = 0 ± 100 and m = 0 ± 100, uncorrelated. Not that if 
the standard deviations of the priors are ‘quite large’ then numerical instabilities arise because the results 
depend on the sum of very large numbers with small ones (the most sensitive of the two is ao(m) which 
starts to create problems above 600, while cro(c)) is quite harmful up to more than 2000. 
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[6,] 

1 

5 0 

0 

0 

1 0 

0 

0 

0 



[7,] 

1 

6 0 

0 

0 

0 1 

0 

0 

0 



[8,] 

1 

8 0 

0 

0 

0 0 

1 

0 

0 



[9,] 

1 

10 0 

0 

0 

0 0 

0 

1 

0 



[10,] 

1 

0 0 

0 

0 

0 0 

0 

0 

1 



> mu 











[1] 

0 0 0 0 

0 0 0 0 

0 0 








> sigma.V <- 

sqrt(diag(V)) 








> round( V /outer(sigma.V,: 

sigma.V] 

A 4 ) 







[,1] 

[,2] 

[,3] 

[,4] 

[,5] 

[,6] 

[.7] 

[,8] 

[,9] 

[,10] 

[1,] 

1.0000 

0.0000 

0.4472 

0.3162 

0.2425 

0.1961 

0.1644 

0.1240 

0.0995 

1.0000 

[2,] 

0.0000 

1.0000 

0.8944 

0.9487 

0.9701 

0.9806 

0.9864 

0.9923 

0.9950 

0.0000 

[3,] 

0.4472 

0.8944 

1.0000 

0.9899 

0.9762 

0.9648 

0.9558 

0.9430 

0.9345 

0.4472 

[4,] 

0.3162 

0.9487 

0.9899 

1.0000 

0.9971 

0.9923 

0.9878 

0.9806 

0.9754 

0.3162 

[5,] 

0.2425 

0.9701 

0.9762 

0.9971 

1.0000 

0.9989 

0.9968 

0.9927 

0.9895 

0.2425 

[6,] 

0.1961 

0.9806 

0.9648 

0.9923 

0.9989 

1.0000 

0.9995 

0.9973 

0.9952 

0.1961 

[7,] 

0.1644 

0.9864 

0.9558 

0.9878 

0.9968 

0.9995 

1.0000 

0.9992 

0.9979 

0.1644 

[8,] 

0.1240 

0.9923 

0.9430 

0.9806 

0.9927 

0.9973 

0.9992 

1.0000 

0.9997 

0.1240 

[9,] 

0.0995 

0.9950 

0.9345 

0.9754 

0.9895 

0.9952 

0.9979 

0.9997 

1.0000 

0.0995 

[10,] 

1.0000 

0.0000 

0.4472 

0.3162 

0.2425 

0.1961 

0.1644 

0.1240 

0.0995 

1.0000 


where the last output shows the initial correlation matrix. All variables are correlated, with 
some exceptions. In fact intercept and slope aren’t, as it should be, and the prediction at 
X = 0 (i.e. Zio) has zero correlation with the slope (its value is not influenced by the slope), 
while it is 100% correlated with the intercept. 

The inference on the model parameters is finally obtained conditioning on the observed 
values of y (this time we use the parameter full=FALSE to avoid large outputs! 

> ( out <- norm.mult.cond(mu, V, c(NA, NA, y, NA, NA, NA), full=FALSE ) ) 


$mu 





[1] 3.599857 1. 

900031 18.800107 22.600169 3.599857 

$V 





[,1] 

[,2] 

[,3] 

[,4] 

[,5] 

[1,] 0.44997876 

-0.09999521 

-0.34998295 

-0.5499734 

0.44997876 

[2,] -0.09999524 

0.02499899 

0.09999666 

0.1499946 

-0.09999524 

[3,] -0.34998318 

0.09999669 

0.69999034 

0.6499837 

-0.34998318 

[4,] -0.54997366 

0.14999467 

0.64998368 

1.1999730 

-0.54997366 

[5,] 0.44997876 

-0.09999521 

-0.34998295 

-0.5499734 

0.69997876 


from which we extract standard uncertainties and correlation coefficient: 

> ( sigmas <- sqrt( diag(out$V) ) ) 

As we can see from the output, the resulting covariance matrix is not exactly symmetrical, due to numeric 
effects. More stable results can be achieved replacing inside norm.mult.cond() the function solveO by 
chol2inv (chol (V22)), which makes used of the so called Choleski Decomposition. For example out$V [2,1] 
and out$V[l,2], respectively equal to —0.09999521 and —0.09999524, would become identical and equal to 
—0.09999498. Nevertheless since this check has been done only at this stage of the paper and being the 
result absolutely negligible, the original matrix inversion function solve () has been used also through all 
this section. 
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1] 0.6708046 0.1581107 0.8366543 1.0954328 0.8366473 
> ( corr <- out$V / outer(sigmas, sigmas) ) 



[.1] 

[,2] 

[,3] 

[,4] 

[,5] 

[1,] 

1.0000000 

-0.9428053 

-0.6235982 

-0.7484450 

0.8017770 

[2,] 

-0.9428055 

1.0000000 

0.7559242 

0.8660217 

-0.7559198 

[3,] 

-0.6235986 

0.7559244 

1.0000000 

0.7092032 

-0.4999870 

[4,] 

-0.7484454 

0.8660219 

0.7092032 

1.0000000 

-0.6000863 

[5,] 

0.8017770 

-0.7559195 

-0.4999867 

-0.6000860 

1.0000000 


Our resulting parametric inference on intercept and slope is then 


3.60 ±0.67 

(123) 

1.90 ±0.16 

(124) 

-0.94, 

(125) 


with the correlation coefficient far from being negligible, and in fact crucial when we want 
to evaluate other quantities that depend on c and m, as we shall see in a while. 

We can check our result, at least as far expectations are concerned, against what we 
obtain using the R function lm(), based on ‘least squares’ll 

> lm(y ~ x) 


Call: 

Im(formula = y ~ x) 


Coefficients: 

(Intercept) x 

3.6 1.9 

The data points, together with the best fit line and the intercept are reported in Fig. [T5j 
The expectations about the future measurements are instead 


Zs = y{x = 8) 

= 18.80 ±0.84 

(126) 

Zg = y{x = 10) 

= 22.60 ±1.10 

(127) 

^10 = y{x = 0) 

= 3.60 ±0.84, 

(128) 


with interesting correlations: 

• p[Z^, Zg] = 0.71, positive and quite high, because their are “on the same side” of the 
’experimental’ points and quite close to each other: due to the uncertainty about the 
slope they could be both smaller or larger than expected. 

• p[Z^,Ziq\ = —0.50, p[Zij,ZiQ\ = —0.60 negative for the opposite reason, and in abso¬ 
lute value increasing with the distance. 

Note how the uncertainty on Zg and Zio are the same, because the corresponding x values 
(8 and 0, respectively) are equally distant, from the barycenter the data along the x axis. 
Instead, (t(Zio) is different from a{c) because they are not the same thing{\): the uncertainty 
is a parameter of the model, while Zio = y{x = 0) is what we would measure at x = 0 on 
the base of the information provided by the previous measurements (and our assumptions 
about the model). 

■^^Under some conditions that usually hold in ‘routine’ applications, the ‘best estimates’ of the parameters 
turn to be practically equal to those obtained using probability theory (see e.g. [^.) 
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X 


Figure 15: Result of the linear fit, including the intercept and it uncertainty. The prediction 
of possible observations at x = 8 and x = 10 are also reported, avoiding instead that at 
X = 0 because it overlaps with the intercept. 

10.2 Uncertainty about /Xy(x) Vs uncertainty about y(x) 

The expected y’s at different values of x are simply the values c+mx calculated at different 
X, as it is easy to check 

> out$niu[l] + out$mu [2] *x. f 

[1] 18.800001 22.600001 3.599999 

More intriguing are the uncertainties. Indeed they get a contribution from the uncertainty 
of the true value yy(x) and that due to the experimental error around it. 

As far as the true values, in our simplified model they are given hy |J.y{xf^) = c + Xf^ m, 
which we can rewrite in matrix form as 

%(Ui) \ / ^ \ \ 

%(U2) = ^ ■ ( rn ) 

/ \ 1 Us / ^ ^ 

Here are then their expected values and covariance matrix directly in R 

> ( C.mu.f <- cbind(rep(l,3), x.f) ) 

x.f 

[ 1 ,] 1 8 
[ 2 ,] 1 10 
[3,] 1 0 
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> ( mu.f <- as.vector( C.mu.f out$mu[l:2] ) ) 

[1] 18.800107 22.600169 3.599857 

> ( V.mu.f <- C.mu.f ’/y/, out$V[1:2,1:2] t(C.mu.f) ) 

[,1] [,2] [,3] 

[1,] 0.4499903 0.6499837 -0.3499832 

[2,] 0.6499836 0.9499730 -0.5499737 

[3,] -0.3499829 -0.5499734 0.4499788 

> ( sigma.mu.f <- sqrt(diag(V.mu.f)) 

[1] 0.6708132 0.9746656 0.6708046 

We have then 

fj,y{x = 8 ) = 

Hy{x = 10) = 

Hy{x = 0 ) = 

with ^y{x = 0) exactly equal to the intercept. And again, the uncertainties on fjLy{x = 8) 
and iJ.y{x = 10) are the same, and then equal to ufclru 

The reason while the uncertainties about y{x) are larger than those of yy{x), for the 
same x, is also easy to understand. To the uncertainty about the true value we have to add 
that due to the experimental error. And in a linear model like ours the two contributions 
add in quadrature, as it easy to check 

while iiy{x = 8) and iiy{x = 10) have the standard uncertainty slightly smaller than 
those of the corresponding y{x = 8) and y{x = 10). To obtain these latter standard 
uncertainties it is enough to add quadratically the standard deviation of the experimental 
error: 


18.8 ±0.67 

(130) 

22.6 ±0.97 

(131) 

3.60 ±0.67, 

(132) 


> sqrt(sigma.mu.f“2 + sy“2) 

[1] 0.8366542 1.0954328 0.8366473 


The effect of the experimental errors is also to dilute the correlations, which among the 
true values are 


^^Under the conditions we are considering here, one can prove that 

{x-xf al 


[^y{x)\ = 


— + ■ 
n — X 


with the uncertainties depending on the absolute value of x — * and on the ‘lever arm’ of the experimental 
date (the larger is their ‘momentum of inertia’, that is x^ — x, the better is the determination of the slope 
and then more accurate the extrapolations. In our case this expression gives 

> var.x <- suin(x“2)/n - mean(x)“2 

> sqrt(sy“2/n + (x.f-mean(x))“2/var.x*sy“2/n ) 

[1] 0.6708204 0.9746794 0.6708204 

practically equal to the results got playing with covariance matrices. 

Note that above formula takes into account the correlation coefficient between c and m. Without it we 
would get 

> sqrt(sigmas[1]“2 + x.f“2*sigmas[2]“2) 

[1] 1.4317521 1.7175207 0.6708046 

with a[fj,y(x = 8)] and (j[^y{x = 10)] wrong by about a factor 2 (while a[fi.y{x = 0)] is right ‘by chance’, 
being equal to the intercept). (The slight numeric difference at the 5th decimal digit is due to the effect of 
the prior, not taken into account in the the above formula.) 
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> V.mu.f / outer(sigma.mu.f, sigma.mu.f) 



[.1] 

[,2] [,3] 

[1,] 

1.0000000 

0.9941347 -0.7777671 

[2,] 

0.9941347 

1.0000000 -0.8411826 

[3,] 

-0.7777666 

-0.8411821 1.0000000 

10.3 

Follows up 


Also in this case one do several other instructive test, which we read to the reader. Here is 
a partial list. 

• Impose a precise value for the intercept and the slope, to see how the other parameter 
changes. This can be done, for example for the intercept (c = 3) with the following 
conditioning: 

> out <- norm.mult.cond(mu, V, c(3, NA, y, NA, NA, NA), full=FALSE) 
(Quick test: what would you expect for ZiqI) 

• Do the same test, but using the previous out$V and out$mu. 

• Use some informative priors for c, m or both. 

• Make a new fit on another 5 data points generated from the same model, using as 
priors for c and m the result of the previous inference (including the correlation!). 

• Make a global fit on the 10 data points of the two datasets (starting from uninforma¬ 
tive priors) and compare with the result of the two inferences in sequence. 

Then is the question of estimating the common standard deviation of the model from the 
data. As told above, this cannot be done with the tools we are playing in this paper because 
the problem is not linear. Certainly a rough estimate can be done by the residuals, but if 
the number of data points is ‘small’ the uncertainty on the estimated sigma do not only 
affect this parameter, but also the joint pdf of c and m, which is longer normal bivariate 
(with consequences on the pdf’s of the previsions). The problem has to be solved using a 
model without short cuts and making the integrals numerically or by Markov Chain Monte 
Carlo, issues which are beyond the aim of this paper. (And pay attention to covariance 
matrices obtained by linearization! [8]) 


11 Propagation of evidence — some general remarks 

Let us take again the diagrams {‘‘graphs’) which describe two observations from the same 
true value and one observation resulting from a true value and a systematic effect. They 
are show again in Fig. flHl labelled with names related to the direction of the ‘causation’ 
arrows, which diverge from a single node or converge towards a single node. The physical 
interpretation is that, as we have already seen, of a single cause producing two effects, or 
two causes responsible of a single effect, respectively. Below each graph we have also added 
the covariance matrix which characterize it, where (T 2 |i = cj[A 2 |jfj], and so on. 

For completeness we have added in the figure also graph in which the effect X 2 is itself 
cause of another effect {serial connection). Sticking to the simple linear models we are 
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Figure 16: Basics causal connections among nodes of a belief network. 

dealing with, the transformation rules of the graph characterize by a serial connection are 
the following: 



= Vl 

(133) 

X2 

= V 2 = Vi + ei 

(134) 

X3 

= U 2 + 62 = Ui + ei + 62 

(135) 


from which the joint covariance matrix reported below the diagram follows, with a 2 \i = 
and CJ 312 = = ^ 62 - 

Analyzing the covariance matrix of the graphs with divergent and serial connections 
we see that the variables are fully correlated: any evidence on any of the three variables 
changes the pdf of the other two. 

Instead, in the convergent graph Xi and X 2 are independent. Indeed, why should the 
physical quantity we are going to measure should depend on a calibration constant of our 
detector? And the other way aroundl^ But we have already seen in the examples that if 
we observe X^, then Xi and X 2 become anticorrelated. 

The effect of the propagation of a condition {^instantiation') of one variable to the rest 
of the network is very interesting also for its practical applications, because it allows to 
decompose a large network in subnetworks. 

11.1 Diverging connection 

We have already seen in the numerical examples of subsection 15.11 that if we condition on a 
value of Xi, then X 2 and X 3 become independent, and the physical reason was very easy 
to be understood. This is a general property of divergent graphs, usually stated referring 

^®In reality this is not impossible, but definitely unusnal 
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Figure 17: Divergent connection with ‘evidence’ (indicated by the symbol ‘i/’) got in some 
of the variables (‘instantiated nodes’). The dashed arrows show the ‘flow of evidence’, i.e. 
how the information flows in the ‘network’. 


to parents and children: in a divergent graph, if a parent is instantiated, the children 
become independent, i.e. evidence does not flow any longer from one child to the other (‘an 
instantiated parent blocks evidence flow among children’ - we assume that there is no other 
connection among them!). The possible flows of evidence are reported in figure fTTl 

Let us make the exercise to calculate the covariance matrix of X 2 and X 3 given Xi. 
To use Eq. [JO] we need to rewrite the three variables in a compact form, thus defining 
Yi = {Xi} and Y 2 = {X 2 , -T 3 }. In this case it is convenient to rewrite (jin]) swapping the 
indices, thus obtaining: 


with 


It follows 


V 



V 22 - V 21 V 12 , 



(136) 


(137) 

(138) 

(139) 

(140) 

(141) 

(142) 
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Figure 18: As Fig. [T3 for a converging connection. 


and hence 


V 






+ cr: 


3|1 


(143) 


As expected, the exercise shows that and X 3 have become conditionally independent. 


11.2 Converging connection 

Instead, in the case of a converging connection, the parents are initially independent. That 
is the evidence on either parent, or on both, can influence the child(s), but cannot be 
transmitted from one parent to the other, as depicted in the graphs of figure [181 Let us 
make the exercise of instantiating the child, X 3 . 

In this case the convenient partition is = {Xi,X 2 } and Y 2 = {AI 3 }, and the 
conditional covariance matrix is obtained applying directly Eq. (|40ll : 


with 


V 



Vii - V12 V21 , 


(144) 


Vll 

II 

(145) 

V12 

b b 

II 

(146) 

V22 

— af + CT2 + 0-311 2 

(147) 

V22 

= (^crf + 0-2 + 0 - 1 ^ 2 ) 

(148) 

V21 

b 

b 

II 

(149) 
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Figure 19: As Fig. [T7] for a serial connection. 


It follows 

Vi2 V 21 


and hence 



erf + us + fj. 


ur 


3 | 1,2 



1 

uf + ul + u 2 ^ 2 


/ uf uful 

Verful u| 

(150) 





/ ^1-{^2+^3|1.2) 

'^1+^2+^IiU 


frf+o-^+cr||i 2 
+^ 311 , 2 ) 

<rf+<ri+<r^|l_2 / 


(151) 


As expected, the exercise shows that Xj and X 2 become anticorrelated, although the cor¬ 
relation coefficient has not a simple intuitive explanation. 


11.3 Serial connection 

Let us repeat the exercise for the serial connection, depicted in figure [T9j The convenient 
partition is now Yi = {Ai, A 3 } and A 2 = {A 2 }. And these are the details 


v„. r' ] 

T/" _ _2 I _2 

V 22 — 0-1+ 0^211 
^ 22 ^ = (^1 + 

F 2 I = 

57 


(152) 

(153) 

(154) 

(155) 

(156) 
























It follows 


Vu V21 


erf + cr,^ 


2 | 1 , 


erf + (T? 


2|1 


(Tt 


erf + erf 


er7 


2|1 

af • (erf + 


(jf + 0-2 


2|1 




and hence 


F 



yf+cr: 


2 

2|1 

~2— 

2|1 


0 

2 

'^ 3|2 


( 157 ) 


(158) 


As expected, the exercise shows that Xi and X 3 become now independent and the uncer¬ 
tainty about A 3 is simply (T 3 | 2 . And also in cr[Ai|jj^^] we recognize a familiar pattern (see 
also footnote HB- 

1 




+ 


cr, 


2|1 


(159) 
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